[Python-Dev] Rethinking intern() and its data structure

Fri Apr 10 02:37:56 CEST 2009

On Apr 9, 2009, at 12:06 PM, Martin v. Löwis wrote:
> Now that you brought up a specific numbers, I tried to verify them,
> and found them correct (although a bit unfortunate), please see my
> test script below. Up to 21800 interned strings, the dict takes (only)
> 384kiB. It then grows, requiring 1536kiB. Whether or not having 22k
> interned strings is "typical", I still don't know.
>
> Wrt. your proposed change, I would be worried about maintainability,
> in particular if it would copy parts of the set implementation.

I connected to a random one of our processes, which has been running  
for a typical amount of time and is currently at ~300MB RSS.

(gdb) p *(PyDictObject*)interned
$2 = {ob_refcnt = 1,
       ob_type = 0x8121240,
       ma_fill = 97239,
       ma_used = 95959,
       ma_mask = 262143,
       ma_table = 0xa493c008,
       ....}

Going from 3MB to 2.25MB isn't much, but it's not nothing, either.

I'd be skeptical of cache performance arguments given that the strings  
used in any particular bit of code should be spread pretty much evenly  
throughout the hash table, and 3MB seems solidly bigger than any L2  
cache I know of.  You should be able to get meaningful numbers out of  
a C profiler, but I'd be surprised to see the act of interning taking  
a noticeable amount of time.

-jake