[Python-Dev] Rethinking intern() and its data structure

Thu Apr 9 19:35:05 CEST 2009

Christian Heimes wrote:
> John Arbash Meinel wrote:
>> When I looked at the actual references from interned, I saw mostly
>> variable names. Considering that every variable goes through the python
>> intern dict. And when you look at the intern function, it doesn't use
>> setdefault logic, it actually does a get() followed by a set(), which
>> means the cost of interning is 1-2 lookups depending on likelyhood, etc.
>> (I saw a whole lot of strings as the error codes in win32all /
>> winerror.py, and windows error codes tend to be longer-than-average
>> variable length.)
> 
> I've read your posting twice but I'm still not sure if you are aware of
> the most important feature of interned strings. In the first place
> interning not about saving some bytes of memory but a speed
> optimization. Interned strings can be compared with a simple and fast
> pointer comparison. With interend strings you can simple write:
> 
>     char *a, *b;
>     if (a == b) {
>         ...
>     }
> 
> Instead of:
> 
>     char *a, *b;
>     if (strcmp(a, b) == 0) {
>         ...
>     }
> 
> A compiler can optimize the pointer comparison much better than a
> function call.
> 

Certainly. But there is a cost associated with calling intern() in the
first place. You created a string, and you are now trying to de-dup it.
That cost is both in the memory to track all strings interned so far,
and the cost to do a dict lookup. And the way intern is currently
written, there is a third cost when the item doesn't exist yet, which is
another lookup to insert the object.

I'll also note that increasing memory does have a semi-direct effect on
performance, because more memory requires more time to bring memory back
and forth from main memory to CPU caches.

...

> I agree that a dict is not the most memory efficient data structure for
> interned strings. However dicts are extremely well tested and highly
> optimized. Any specialized data structure needs to be desinged and
> tested very carefully. If you happen to break the interning system it's
> going to lead to rather nasty and hard to debug problems.

Sure. My plan was to basically take the existing Set/Dict design, and
just tweak it slightly for the expected operations of "interned".

> 
>>   e) How tuned is String.hash() for the fact that most of these strings
>>      are going to be ascii text? (I know that python wants to support
>>      non-ascii variable names, but I still think there is going to be an
>>      overwhelming bias towards characters in the range 65-122 ('A'-'z').
> 
> Python 3.0 uses unicode for all names. You have to design something that
> can be adopted to unicode, too. By the way do you know that dicts have
> an optimized lookup function for strings? It's called lookdict_unicode /
>  lookdict_string.

Sure, but so does PySet. I'm not sure about lookset_unicode, but I would
guess that exists or should exist for py3k.

> 
>> Also note that the performance of the "interned" dict gets even worse on
>> 64-bit platforms. Where the size of a 'dictentry' doubles, but the
>> average length of a variable name wouldn't change.
>>
>> Anyway, I would be happy to implement something along the lines of a
>> "StringSet", or maybe the "InternSet", etc. I just wanted to check if
>> people would be interested or not.
> 
> Since interning is mostly used in the core and extension modules you
> might want to experiment with a different growth rate. The interning
> data structure could start with a larger value and have a slower, non
> progressive data growth rate.
> 
> Christian

I'll also mention that there are other uses for intern() where it is
uniquely suitable. Namely, if you are parsing lots of text with
redundant strings, it is a way to decrease total memory consumption.
(And potentially speed up future comparisons, etc.)

The main reason why intern() is useful for this is because it doesn't
make strings immortal, as would happen if you used some other structure.
Because strings know about the "interned" object.

The options for a 3rd-party structure fall down into something like:

1) A cache that makes the strings immortal. (IIRC this is what older
versions of Python did.)

2) A cache that is periodically walked to see if any of the objects are
no longer externally referenced. The main problem here is that walking
is O(all-objects), whereas doing the checking at refcount=0 time means
you only check objects when you think the last reference has gone away.

3) Hijacking PyStringType->dealloc, so that when the refcount goes to 0
and Python want's to destroy the string, you then trigger your own cache
to look and see if it should remove the object.

Even further, you either have to check on every string dealloc, or
re-use PyStringObject->ob_sstate to track that you have placed this
string into your custom structure. Which would preclude ever calling
intern() on this string, because intern() doesn't just check a couple
bits, it looks at the entire ob_sstate value.

I think you could make it work, such that if your custom cache had set
some values, then intern() would just return without evaluating, and
during dealloc you could make sure that you set ob_sstate back to 0
before letting the rest of the python machinery dealloc the string.

John
=:->