[Python-ideas] Accelerated attr lookups

Tue Jun 19 17:13:10 CEST 2007

Hi, I have attached a patch at:
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1739789&group_id=5470

A common optimization tip for Python code is to use locals rather than
globals. This converts dictionary lookups of (interned) strings to
tuple indexing. I have created a patch that achieves this speed
benefit "automatically" for all globals and builtins, by adding a
feature to dictobjects.

Additionally, the idea of this patch is that it puts down the
necessary infrastructure to also allow virtually all attribute
accesses to also be
accelerated in the same way (with some extra work, of course).

I have already suggested this before but I got the impression that the
spirit of the replies was "talk is cheap, show us the
code/benchmarks".  So I wrote some code.

Getting the changes to work was not easy, and required learning about
the nuances of dictobject's, their re-entrancy issues, etc. These
changes do slow down dictobjects, but it seems that this slowdown is
more than offset by the speed increase of builtins/globals access.

Benchmarks:

A set of benchmarks that repeatedly perform:
A. Global reads
B. Global writes
C. Builtin reads
with little overheads (just repeatedly issuing global/builtin access
bytecodes, many times per loop iteration to minimize the loop
overhead), yield 30% time decrease (~42% speed increase).

Regression tests take ~62 seconds (of user+sys time) with Python2.6 trunk
Regression tests take ~65 seconds (of user+sys time) with the patch
Regression tests are about ~4.5% slower.
(Though Regression tests probably spend their running time on a lot
more code than other programs, so are not a good benchmark, which
spends more time instantiating function objects, and less time
executing them)

pystone seems to be improved by about 5%.

My conclusions:
The LOAD_GLOBAL/STORE_GLOBAL opcodes are considerably faster.  Dict
accesses or perhaps the general extra activity around seem to be
insignificantly slower, or at least cancel out against the speed
benefits in the regression tests.

The next step I am going to try, is to replace the PyObject_GetAttr
call with code that:
* Calls PyObject_GetAttr only if GenericGetAttr is not the object's
handler, as to allow modifying the behaviour.
* Otherwise, remember for each attribute-accessing opcode, the last
type from which the attribute was accessed. A single pointer
comparison can check if the attribute access is using the same type.
In case it does, it can use a stored exported key from the type
dictionary [or from an mro cache dictionary for that type, if that is
added], rather than a dict lookup. If it yields the same speed
benefit, it could make attribute access opcodes up-to 42% faster as
well, when used on the same types (which is probably the common case,
particularly in inner loops).

This will allow, with the combination of __slots__, to eliminate all
dict lookups for most instance-side accesses as well.

P.S: I discovered a lot of code duplication (and "went along" and
duplicated my code in the same spirit), but was wondering if a patch
that utilized C's preprocessor heavily to prevent code duplication in
CPython's code, and trusting the "inline" keyword to prevent thousands
of lines in the same function (ceval.c's opcode switch) would be
accepted.