Interned Strings

Tue Jan 10 18:32:01 EST 2006

Dave wrote:
> I'm trying to clarify how Python avoids byte by byte
> string comparisons most of the time. As I understand,
> dictionaries keep strings, their keys (hash values),
> and caches of their keys. 

If you are talking about dictionaries with string keys,
you also have the dictionary values, of course.

> Caching keys helps to avoid
> recalculation of a string's hash value. 

Correct (s/keys/string hash values/).

> So, when two
> strings need to be compared, only their cached keys
> are compared, which improves performance as there is
> no need for byte by byte comparison.

No. When comparing to strings s1 and s2, the following
operations occur:
1. is s1 and s2 the very same string? If yes, they
   are equal.
2. else, do they have the same size, the same first
   byte (which might be a null byte), and do they
   compare equal, byte-by-byte?
   If yes, they are equal, if not, they are not equal.
3. Is it perhaps some other compare operation (<,>

    <=, >, >=) that we want to perform? Do the
   slow algorithm.

As you can see, the string hash is never consulted when
comparing string objects. It is only consulted to
find the potential dictionary slot in the first place.

> Also, there is a global interning dictionary that
> keeps interned strings. What I don't understand is why
> strings are interned. How does it help with string
> comparisons? 

Why you look up a dictionary entry, this happens:
1. compute the key hash.
2. find the corresponding dictionary slot
   If the slot is empty, KeyError.
3. compare the slot key with the search key.
   If they are equal: value found.
   If they are different: collision, go to the
   next key.

Interned strings speed up step 1 and step 3. If
you only have interned strings throughout, you always
also have the hash value. Of course, you had to
compute the hash value when adding the string
to the interning dictionary.

The real speedup is in step 3, and based on
the assumption that collisions won't happen:
if you lookup the key (e.g. to find the value
of a global variable), you find the slot using
the computed hash. Then:
1. if the slot is empty, it's a KeyError.
2. if the slot is not empty, you first compare
   for pointer equality. As collisions are
   supposedly unlikely, this will be an equal
   string most of the time. Then, if you have
   interning, it even will be the *same* string,
   so you only need to compare pointers to find
   out they are the same string.

So assuming all strings are interned, this is how
you do the dictionary lookup.
1. fetch the hash value from the key (no need to
   compute it - it's already cached)
2. go to the slot in the dict.
3. if the slot is empty (==NULL): KeyError
4. otherwise: success.
As you can see, in that case, there is no need to
compare the string values. If there are collisions,
and if not all strings are interned, the algorithm
gets more complicated, but four items above are
assumed to be the common case.

Regards,
Martin