[pypy-svn] r17545 - pypy/dist/pypy/doc
tismer at codespeak.net
tismer at codespeak.net
Tue Sep 13 21:38:56 CEST 2005
Author: tismer
Date: Tue Sep 13 21:38:50 2005
New Revision: 17545
Added:
pypy/dist/pypy/doc/thoughts_string_interning.txt (contents, props changed)
Log:
after I spent some more time on the effects of string interning
than I planned, and the effect does not break the limit to
justify some real work, I considered to preserve some thoughts
and experience with this for later use.
Added: pypy/dist/pypy/doc/thoughts_string_interning.txt
==============================================================================
--- (empty file)
+++ pypy/dist/pypy/doc/thoughts_string_interning.txt Tue Sep 13 21:38:50 2005
@@ -0,0 +1,197 @@
+String Interning in PyPy
+===========================
+
+A few thoughts about string interning. CPython gets a remarkable
+speed-up by interning strings. Interned are all builtin string
+objects and all strings used as names. The effect is that when
+a string lookup is done during instance attribute access,
+the dict lookup method will find the string always by identity,
+saving the need to do a string comparison.
+
+Interned Srings in CPython
+--------------------------
+
+CPython keeps an internal dictionary named ``interned`` for all of these
+strings. It contains the string both as key and as value, which means
+there are two extra references in principle. Upto Version 2.2, interned
+strings were considered immortal. Once they entered the ``interned`` dict,
+nothing could revert this memory usage.
+
+Starting with Python 2.3, interned strings became mortal by default.
+The reason was less memory usage for strings that have no external
+reference any longer. This seems to be a worthwhile enhancement.
+Interned strings that are really needed always have a real reference.
+Strings which are interned for temporary reasons get a big speed up
+and can be freed after they are no longer in use.
+
+This was implemented by making the ``interned`` dictionary a weak dict,
+by lowering the refcount of interned strings by 2. The string deallocator
+got extra handling to look into the ``interned`` dict when a string is deallocated.
+This is supported by the state variable on string objects which tells
+whether the string is not interned, immortal or mortal.
+
+Implementation problems for PyPy
+--------------------------------
+
+- The CPython implementation makes explicit use of the refcount to handle
+ the weak-dict behavior of ``interned``. PyPy does not expose the implementation
+ of object aliveness. Special handling would be needed to simulate mortal
+ behavior. A possible but expensive solution would be to use a real
+ weak dictionary. Another way is to add a special interface to the backend
+ that allows either the two extra references to be reset, or for the
+ boehm collector to exclude the ``interned`` dict from reference tracking.
+
+- PyPy implements quite complete internal strings, as opposed to CPython
+ which always uses its "applevel" strings. It also supports low-level
+ dictionaries. This adds some complication to the issue of interning.
+ Additionally, the interpreter currently handles attribute access
+ by calling wrap(str) on the low-level attribute string when executing
+ frames. This implies that we have to primarily intern low-level strings
+ and cache the created string objects on top of them.
+ A possible implementation would use a dict with ll string keys and the
+ string objects as values. In order to save the extra dict lookup, we also
+ could consider to cache the string object directly on a field of the rstr,
+ which of course adds some extra cost. Alternatively, a fast id-indexed
+ extra dictionary can provide the mapping from rstr to interned string object.
+ But for efficiency reasons, it is anyway necessary to put an extra flag about
+ interning on the strings. Flagging this by putting the string object itself
+ as the flag might be acceptable. A dummyobject can be used if the interned
+ rstr is not exposed as an interned string object.
+
+A prototype brute-force patch
+--------------------------------
+
+In order to get some idea how efficient string interning is at the moment,
+I implemented a quite crude version of interning. I patched space.wrap
+to call this intern_string instead of W_StringObject::
+
+ def intern_string(space, str):
+ if we_are_translated():
+ _intern_ids = W_StringObject._intern_ids
+ str_id = id(str)
+ w_ret = _intern_ids.get(str_id, None)
+ if w_ret is not None:
+ return w_ret
+ _intern = W_StringObject._intern
+ if str not in _intern:
+ _intern[str] = W_StringObject(space, str)
+ W_StringObject._intern_keep[str_id] = str
+ _intern_ids[str_id] = w_ret = _intern[str]
+ return w_ret
+ else:
+ return W_StringObject(space, str)
+
+This is no general solution at all, since it a) does not provide
+interning of rstr and b) interns every app-level string. The
+implementation is also by far not as efficient as it could be,
+because it utilizes an extra dict _intern_ids which maps the
+id of the rstr to the string object, and a dict _intern_keep to
+keep these ids alive.
+
+With just a single _intern dict from rstr to string object, the
+overall performance degraded slightly instead of an advantage.
+The triple dict patch accelerates richards by about 12 percent.
+Since it still has the overhead of handling the extra dicts,
+I guess we can expect twice the acceleration if we add proper
+interning support.
+
+The resulting estimated 24 % acceleration is still not enough
+to justify an implementation right now.
+
+Here the results of the richards benchmark::
+
+ D:\pypy\dist\pypy\translator\goal>pypy-c-17516.exe -c "from richards import *;Richards.iterations=1;main()"
+ debug: entry point starting
+ debug: argv -> pypy-c-17516.exe
+ debug: argv -> -c
+ debug: argv -> from richards import *;Richards.iterations=1;main()
+ Richards benchmark (Python) starting... [<function entry_point at 0xeae060>]
+ finished.
+ Total time for 1 iterations: 38 secs
+ Average time for iterations: 38885 ms
+
+ D:\pypy\dist\pypy\translator\goal>pypy-c.exe -c "from richards import *;Richards.iterations=1;main()"
+ debug: entry point starting
+ debug: argv -> pypy-c.exe
+ debug: argv -> -c
+ debug: argv -> from richards import *;Richards.iterations=1;main()
+ Richards benchmark (Python) starting... [<function entry_point at 0xead810>]
+ finished.
+ Total time for 1 iterations: 34 secs
+ Average time for iterations: 34388 ms
+
+ D:\pypy\dist\pypy\translator\goal>
+
+
+This was just an exercize to get an idea. For sure this is not to be checked in.
+Instead, I'm attaching the simple patch here for reference.
+::
+
+ Index: objspace/std/objspace.py
+ ===================================================================
+ --- objspace/std/objspace.py (revision 17526)
+ +++ objspace/std/objspace.py (working copy)
+ @@ -243,6 +243,9 @@
+ return self.newbool(x)
+ return W_IntObject(self, x)
+ if isinstance(x, str):
+ + # XXX quick speed testing hack
+ + from pypy.objspace.std.stringobject import intern_string
+ + return intern_string(self, x)
+ return W_StringObject(self, x)
+ if isinstance(x, unicode):
+ return W_UnicodeObject(self, [unichr(ord(u)) for u in x]) # xxx
+ Index: objspace/std/stringobject.py
+ ===================================================================
+ --- objspace/std/stringobject.py (revision 17526)
+ +++ objspace/std/stringobject.py (working copy)
+ @@ -18,6 +18,10 @@
+ class W_StringObject(W_Object):
+ from pypy.objspace.std.stringtype import str_typedef as typedef
+
+ + _intern_ids = {}
+ + _intern_keep = {}
+ + _intern = {}
+ +
+ def __init__(w_self, space, str):
+ W_Object.__init__(w_self, space)
+ w_self._value = str
+ @@ -32,6 +36,21 @@
+
+ registerimplementation(W_StringObject)
+
+ +def intern_string(space, str):
+ + if we_are_translated():
+ + _intern_ids = W_StringObject._intern_ids
+ + str_id = id(str)
+ + w_ret = _intern_ids.get(str_id, None)
+ + if w_ret is not None:
+ + return w_ret
+ + _intern = W_StringObject._intern
+ + if str not in _intern:
+ + _intern[str] = W_StringObject(space, str)
+ + W_StringObject._intern_keep[str_id] = str
+ + _intern_ids[str_id] = w_ret = _intern[str]
+ + return w_ret
+ + else:
+ + return W_StringObject(space, str)
+
+ def _isspace(ch):
+ return ord(ch) in (9, 10, 11, 12, 13, 32)
+ Index: objspace/std/stringtype.py
+ ===================================================================
+ --- objspace/std/stringtype.py (revision 17526)
+ +++ objspace/std/stringtype.py (working copy)
+ @@ -47,6 +47,10 @@
+ if space.is_true(space.is_(w_stringtype, space.w_str)):
+ return w_obj # XXX might be reworked when space.str() typechecks
+ value = space.str_w(w_obj)
+ + # XXX quick hack to check interning effect
+ + w_obj = W_StringObject._intern.get(value, None)
+ + if w_obj is not None:
+ + return w_obj
+ w_obj = space.allocate_instance(W_StringObject, w_stringtype)
+ W_StringObject.__init__(w_obj, space, value)
+ return w_obj
+
+ciao - chris
More information about the Pypy-commit
mailing list