Confusion about dictionaries - keys use value or identity?

Sun Jul 8 19:14:56 EDT 2001

[Steve Holden]
> We are talking about has_key here, right?

RIght.

> I was simply trying (perhaps rather too hurriedly, as it turned out)
> the fact that strings if unequal lengths could not be equal, and only
> equality is required when indexing a dictionary.

Good point, and I hadn't fully realized the context.  Things are still
subtler than you may think, though.

> Without looking at the code I'm presuming that actually some fairly
> crafty hashing algorithm is used to test for the presence of a
> particular key.

Yes.

> But when more that one entry hashes to the same bucket,

That doesn't matter:  the full hash codes are compared first no matter how
few buckets, and it's extremely rare in practice for the full hash codes of
unequal strings to be equal (the odds are about 1 in 2**32, confirmed by
many test runs on both "typical" and contrived strings).  That is, it's not
ending up in the same bucket that matters, it's having identical hash codes:
no comparison is done unless the full hash codes are equal (BTW, string
objects cache their hash codes, so a given string object never *computes*
its hash code more than once no matter how often you-- or the
internals --apply hash() to it).

> I hope the lengths are checked before character-by-character comparison
> goes ahead ... you know the details, I don't, so I'm not going to press
> this issue any further.

The lengths aren't checked first in 2.1 or earlier, in part because before
rich comparisons were added, it was impossible for the string-compare
routine to *know* that == vs != was the only interesting outcome.  There is
code in current CVS to check the lengths first, but because string hash code
equality is so rare I'm dubious that bothering with the length-check is
actually a win (string compare from dicts will almost never be called unless
the strings are in fact equal, so the "lengths equal?" check almost always
says "yes" in this context).

> And no, I haven't forgotten that strings are only a subset of the
> possible keys.

Python has forgotten it, though <wink>.  That is, under the covers, dicts
with nothing but string keys use a different lookup routine than dicts with
at least one non-string key.  Dicts with string keys are optimized in many
subtle ways.