[Python-Dev] String interning

Oren Tirosh oren-py-d@hishome.net
Mon, 1 Jul 2002 09:31:20 +0300


I was looking into the string interning code to see how it works. Here are
my observations:

A Python string can be in one of three states:

1. Not interned: s->ob_sinterned == NULL
2. Directly-interned: s->ob_sinterned == s
3. Indirectly interned. ob_sinterned points to another string:
     s->ob_sinterned != s && s->ob_sinterned != NULL

Indirectly interned strings are quite rare.  Creating one requires that a
string already exist in the interned dictionary and that an equal string 
with multiple references be interned.  The reference used for internining 
is replaced with the previously interned string.  The other references 
to the same string will become indirectly interned.

References to the ob_sinterned field are found in in stringobject.c and 
dictobject.c and, inexplicably, in Mac/Python/macimport.c

In stringobject.c most references to ob_sinterned are to initialize it. The
only place that uses it is string_hash:  if ob_sinterned is not NULL it uses 
the hash of the string it points to instead of the current string object. 
If the string is directly interned this is just a longer way of doing the
same thing.  If the string is indirectly interned this is merely redundant 
because the hash of the two strings should be equal (I hope so!). The only 
thing this test could have saved is recalculating the hash from the string 
if the cached hash is zero. This doesn't happen because if the string
is indirectly interned it has been used as a key during interning which
initializes its cached hash.

In dictobject.c the only reference to ob_sinterned is in PyDict_SetItem: if
the string is interned it uses the ob_sinterned pointer as the key instead
of the argument. This could only make a difference if the string is 
indirectly interned. It turns out that this never happens. I couldn't find
one occurence of SetItem with an indirectly interned string as key in the
regression tests or any other Python code I have tested.  Even if this did
happen it wouldn't cause any problems to ignore this case - the lookup 
would still function correctly.

Summary: As far as I can tell, indirectly interned strings are redundant. 
Without them the ob_sinterned field is effectively a boolean flag. The size 
of all string objects can be reduced by 3 bytes.

Can anyone explain why interning is implemented the way it is?  Can anyone
explain why Mac/Python/macimport.c is messing with ob_sinterned?

	Oren