[Python-Dev] patch: speed up name access by up to 80%

Mon, 11 Feb 2002 22:09:54 +0200

Problem: Python name lookup in dictionaries is relatively slow.

Possible solutions:

1. Find ways to bypass dictionary lookup.
2. Optimize the hell out of dictionary lookup.

Examples of approach #1 are the existing fastlocals mechanism, PEP 266,
PEP 267 and the recent proposal by GvR for speeding up instance attribute
access. These proposals all face major difficulties resulting from the
dynamic nature of Python's namespace and require tricky techniques of code
analysis code to check for assignments that may change the visible namespace.

I have chosen to try #2. The biggest difficulty with this approach is that
dictionaries are already very well optimized :-) But I have found that
Python dictionaries are mostly optimized for general-purpose use, not for
use as namespace for running code. There are some characteristics unique to 
namespace access which have not been used for optimization:

* Lookup keys are always interned strings.
* The local/global/builtin fallback means that most accesses fail.

The patch adds the function PyDict_GetItem_Fast to dictobject. This
function is equivalent to PyDict_GetItem but is much faster for lookup
using interned string keys and for lookups with a negative result.
LOAD_NAME and LOAD_GLOBAL have been converted to use this function.

Here are the timings for 200,000,000 accesses to fastlocals, locals,
globals and builtins for Python 2.2 with and without the fastnames patch:

                2.2       2.2+fastnames
---------------------------------------
builtin.py      68.540s   37.900s
global.py       54.570s   34.020s
local.py        41.210s   32.780s
fastlocal.py    24.530s   24.540s

Fastlocals are still significantly faster than locals, but not by such a 
wide margin. You can notice that the speed differences between locals, 
globals and builtins are almost gone.

The machine is a Pentium III at 866MHz running linux. Both interpreters 
were compiled with "./configure ; make". The test code is at the bottom of
this message. You will not see any effect on the results of pybench
because it uses only fastlocals inside the test loops. Real code that uses
a lot of globals and builtins should see a significant improvement.

Get the patch:

http://www.tothink.com/python/fastnames/fastnames.patch

The optimization techniques used:

* Inline code
PyDict_GetItem_Fast is actually an inline macro. If the item is stored at
the first hash location it will be returned without any function calls. It
also requires that the key used for the entry is the same interned string
as the key. This fast inline version is possible because the first
argument is known to be a valid dictionary and the second argument is
known to be a valid interned string with a valid cached hash.

* Key comparison by pointer
If the inline macro fails PyDict_GetItem_Fast2 is called. This function
searches for an entry with a key identical to the requested key. This
search is faster than lookdict or lookdict_string because there are no 
expensive calls to external compare-by-value functions. Another small
speedup is gained by not checking for free slots since this function is
never used for setting items. If this search fails, the dictionary's 
ma_lookup function is called.

* Interning of entry keys
One reason the quick search could fail is because the entry was set using
direct access to __dict__ instead of standard name assignment and
therefore the entry key is not interned. In this case the entry key is
replaced with the interned lookup key. The next fast search for the same
key will succeed. There is a very good chance that it will be handled by
the inline macro.

* Negative entries
In name lookup most accesses fail. In order to speed them up negative
entries can mark a name as "positively not there", usually detected by the
macro without requiring any function calls. Negative entries have the
interned key as their me_key and me_value is NULL. Negative entries occupy
real space in the hash table and cannot be reused as empty slots. This
optimization technique is not practical for general purpose dictionaries
because some types of code would quickly overload the dictionary with
many negative entries. For name lookup the number of negative entries is
bound by the number of global and builtin names referenced by the code
that uses the dictionary as a namespace.

This new type of slot has a surprisingly small impact on the rest of
dictobject.c. Only one assertion had to be removed to accomodate it. All
other code treats it as either an active entry (key !=NULL, !=dummy) or a 
deleted entry (value == NULL, key != NULL) and just happens to do the Right 
Thing for each case.

If an entry with the same key as a negative entry is subsequently inserted 
into the dictionary it will overwrite the negative entry and be reflected 
immediately in the namespace. There is no caching and therefore no cache 
coherency issues.

Known bugs:

Negative entries do not resize the table. If there is not enough free space 
in the table they are simply not inserted.

Assumes CACHE_HASH, INTERN_STRINGS without checking.

Future directions:

It should be possible to apply this to more than just LOAD_ATTR and 
LOAD_GLOBAL: attributes, modules, setting items, etc.

The hit rate for the inline macro varies from 100% in most simple cases 
to 0% in cases where the first hash position in the table happens to be 
occupied by another entry. Even in these cases it is still very fast, but 
I want to get more consistent performance. I am starting to experiment 
with probabilistic techniques that shuffle entries in the hash table and 
try to ensure that the entries accessed most often are kept in the first 
hash positions as much as possible.

Test code:

* builtin.py
class f:
  for i in xrange(10000000):
    hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ;
    hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ;

* global.py
hex = 5
class f:
  for i in xrange(10000000):
    hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ;
    hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ;

* local.py
class f:
  hex = 5
  for i in xrange(10000000):
    hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ;
    hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ;

* fastlocal.py
def f():
  hex = 5
  for i in xrange(10000000):
    hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ;
    hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ;

f()

	Oren