[Python-Dev] Rethinking intern() and its data structure

Thu Apr 9 17:02:14 CEST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I've been doing some memory profiling of my application, and I've found
some interesting results with how intern() works. I was pretty surprised
to see that the "interned" dict was actually consuming a significant
amount of total memory.
To give the specific values, after doing:
  bzr branch A B
of a small project, the total memory consumption is ~21MB

Of that, the largest single object is the 'interned' dict, at 1.57MB,
which contains 22k strings. One interesting bit, the size of it + the
referenced strings is only 2.4MB. So the "interned" dict *by itself* is
2/3rds the size of the dict + strings it contains.

It also means that the average size of a referenced string is 37.4
bytes. A 'str' has 24 bytes of overhead, so the average string is 13.5
characters long. So to save references to 13.5*22k ~ 300kB of character
data, we are paying 2.4MB, or about 8:1 overhead.

When I looked at the actual references from interned, I saw mostly
variable names. Considering that every variable goes through the python
intern dict. And when you look at the intern function, it doesn't use
setdefault logic, it actually does a get() followed by a set(), which
means the cost of interning is 1-2 lookups depending on likelyhood, etc.
(I saw a whole lot of strings as the error codes in win32all /
winerror.py, and windows error codes tend to be longer-than-average
variable length.)

Anyway, I the internals of intern() could be done a bit better. Here are
some concrete things:

  a) Don't keep a double reference to both key and value to the same
     object (1 pointer per entry), this could be as simple as using a
     Set() instead of a dict()

  b) Don't cache the hash key in the set, as strings already cache them.
     (1 long per entry). This is a big win for space, but would need to
     be balanced against lookup and collision resolving speed.

     My guess is that reducing the size of the set will actually improve
     speed more, because more items can fit in cache. It depends on how
     many times you need to resolve a collision. If the string hash is
     sufficiently spread out, and the load factor is reasonable, then
     likely when you actually find an item in the set, it will be the
     item you want, and you'll need to bring the string object into
     cache anyway, so that you can do a string comparison (rather than
     just a hash comparison.)

  c) Use the existing lookup function one time. (PySet->lookup())
     Sets already have a "lookup" which is optimized for strings, and
     returns a pointer to where the object would go if it exists. Which
     means the intern() function can do a single lookup resolving any
     collisions, and return the object or insert without doing a second
     lookup.

  d) Having a special structure might also allow for separate optimizing
     of things like 'default size', 'grow rate', 'load factor', etc. A
     lot of this could be tuned specifically knowing that we really only
     have 1 of these objects, and it is going to be pointing at a lot of
     strings that are < 50 bytes long.

     If hashes of variable name strings are well distributed, we could
     probably get away with a load factor of 2. If we know we are likely
     to have lots and lots that never go away (you rarely *unload*
     modules, and all variable names are in the intern dict), that would
     suggest having a large initial size, and probably a wide growth
     factor to avoid spending a lot of time resizing the set.

  e) How tuned is String.hash() for the fact that most of these strings
     are going to be ascii text? (I know that python wants to support
     non-ascii variable names, but I still think there is going to be an
     overwhelming bias towards characters in the range 65-122 ('A'-'z').

Also note that the performance of the "interned" dict gets even worse on
64-bit platforms. Where the size of a 'dictentry' doubles, but the
average length of a variable name wouldn't change.

Anyway, I would be happy to implement something along the lines of a
"StringSet", or maybe the "InternSet", etc. I just wanted to check if
people would be interested or not.

John
=:->

PS> I'm not yet subscribed to python-dev, so if you could make sure to
CC me in replies, I would appreciate it.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkneDfYACgkQJdeBCYSNAAPMywCfQVWOg51dtIkWT/jttVTARV0g
WJ4An1w7ypB+akHT5hiSwRKoUhH7ez4j
=9TTp
-----END PGP SIGNATURE-----