[Python-Dev] Rethinking intern() and its data structure
John Arbash Meinel
john at arbash-meinel.com
Thu Apr 9 17:02:14 CEST 2009
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I've been doing some memory profiling of my application, and I've found
some interesting results with how intern() works. I was pretty surprised
to see that the "interned" dict was actually consuming a significant
amount of total memory.
To give the specific values, after doing:
bzr branch A B
of a small project, the total memory consumption is ~21MB
Of that, the largest single object is the 'interned' dict, at 1.57MB,
which contains 22k strings. One interesting bit, the size of it + the
referenced strings is only 2.4MB. So the "interned" dict *by itself* is
2/3rds the size of the dict + strings it contains.
It also means that the average size of a referenced string is 37.4
bytes. A 'str' has 24 bytes of overhead, so the average string is 13.5
characters long. So to save references to 13.5*22k ~ 300kB of character
data, we are paying 2.4MB, or about 8:1 overhead.
When I looked at the actual references from interned, I saw mostly
variable names. Considering that every variable goes through the python
intern dict. And when you look at the intern function, it doesn't use
setdefault logic, it actually does a get() followed by a set(), which
means the cost of interning is 1-2 lookups depending on likelyhood, etc.
(I saw a whole lot of strings as the error codes in win32all /
winerror.py, and windows error codes tend to be longer-than-average
variable length.)
Anyway, I the internals of intern() could be done a bit better. Here are
some concrete things:
a) Don't keep a double reference to both key and value to the same
object (1 pointer per entry), this could be as simple as using a
Set() instead of a dict()
b) Don't cache the hash key in the set, as strings already cache them.
(1 long per entry). This is a big win for space, but would need to
be balanced against lookup and collision resolving speed.
My guess is that reducing the size of the set will actually improve
speed more, because more items can fit in cache. It depends on how
many times you need to resolve a collision. If the string hash is
sufficiently spread out, and the load factor is reasonable, then
likely when you actually find an item in the set, it will be the
item you want, and you'll need to bring the string object into
cache anyway, so that you can do a string comparison (rather than
just a hash comparison.)
c) Use the existing lookup function one time. (PySet->lookup())
Sets already have a "lookup" which is optimized for strings, and
returns a pointer to where the object would go if it exists. Which
means the intern() function can do a single lookup resolving any
collisions, and return the object or insert without doing a second
lookup.
d) Having a special structure might also allow for separate optimizing
of things like 'default size', 'grow rate', 'load factor', etc. A
lot of this could be tuned specifically knowing that we really only
have 1 of these objects, and it is going to be pointing at a lot of
strings that are < 50 bytes long.
If hashes of variable name strings are well distributed, we could
probably get away with a load factor of 2. If we know we are likely
to have lots and lots that never go away (you rarely *unload*
modules, and all variable names are in the intern dict), that would
suggest having a large initial size, and probably a wide growth
factor to avoid spending a lot of time resizing the set.
e) How tuned is String.hash() for the fact that most of these strings
are going to be ascii text? (I know that python wants to support
non-ascii variable names, but I still think there is going to be an
overwhelming bias towards characters in the range 65-122 ('A'-'z').
Also note that the performance of the "interned" dict gets even worse on
64-bit platforms. Where the size of a 'dictentry' doubles, but the
average length of a variable name wouldn't change.
Anyway, I would be happy to implement something along the lines of a
"StringSet", or maybe the "InternSet", etc. I just wanted to check if
people would be interested or not.
John
=:->
PS> I'm not yet subscribed to python-dev, so if you could make sure to
CC me in replies, I would appreciate it.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkneDfYACgkQJdeBCYSNAAPMywCfQVWOg51dtIkWT/jttVTARV0g
WJ4An1w7ypB+akHT5hiSwRKoUhH7ez4j
=9TTp
-----END PGP SIGNATURE-----
More information about the Python-Dev
mailing list