[Python-Dev] Caching of builtins and globals

Thu, 12 Jun 2003 15:23:12 +0200

I have played a bit with the notion of caching builtin and global values 
(to be precise the address of their dictionary entry) without changing 
language semantics:

I have an experimental modification to the interpreteter such that I get 
these numbers: [bl_ is the baseline CVS code, ex_ is the modified interpreter]:

$ ./bl_python test.py
nop func
5.26905059814  ms
5.17392158508  ms
5.10895252228  ms
all-locals func
3537.45496273  ms
3536.81099415  ms
3537.69803047  ms
global/builtin-heavy func
4377.18105316  ms
4378.40998173  ms
4375.61702728  ms
sup ctr
57391.8089867  ms
57239.0290499  ms
57295.1930761  ms
$ ./ex_python test.py
nop func
5.01906871796  ms
5.34105300903  ms
5.30099868774  ms
all-locals func
3505.41305542  ms
3529.50799465  ms
3529.50108051  ms
global/builtin-heavy func
3750.57899952  ms
3752.10404396  ms
3725.95608234  ms
sup ctr
47528.2179117  ms
47529.8720598  ms
47295.0160503  ms
$

for

<test.py>
R=3
N = 5000
M = 1000
import time
ch = 'T'

def bot():
     pass
class d(dict):
     def __init__(self):
        dict.__init__(self,{1:1})
def f():
     ord_ = ord
     ch_ = ch
     x = 0
     for i in range(M):
         x += ord_(ch_)
def g():
     x = 0
     for i in range(M):
         x += ord(ch)
def h():
     dic = None
     for i in range(M):
         dic = d()
def test(func,n):
     t = time.time()
     for i in xrange(N):
         func()
     print (time.time()-t)*1000, ' ms'

print "nop func"
for i in range(R):
     test(bot,N)
print "all-locals func"
for i in range(R):
     test(f,N)
print "global/builtin-heavy func"
for i in range(R):
     test(g,N)
print "sup ctr"
for i in range(R):
     test(h,100)
</test.py>

[AMD Athlon 750Mhz under RH Linux 7.2, pystone seems to get 4/5% 
improvements, neverthelss I expect slowdowns in some cases]

idea/issues:
- the most "evil" thing I do is add a further pointer to all dicts to store 
what I call a version token. Reshaping* mutation paths for dicts grow code 
for updating the version token, this is executed conditionally only if the 
version token pointer is non-null, i.e. the version token was asked at 
least once and so one was put in place. So mutation for normal dicts (not 
used as globals/builtins) pay only to skip the code with a 
check-non-null-jump. [version tokens are small objects with their own ref 
counting logic, their are not version counters for reasons related to 
overflow and so that they capture also the identity of their dicts:
	dic1 != dic2 => dic1.version != dic2.version ].
Using different dict kinds could avoid the all-pay problem. This is slowed 
down:

import mod
mod.foo = 1

if 'foo' doesn't already exist in mod because a new version token must be 
generated.

[*: empty entry to non-empty, the converse, and things that change 
key->entry (address) mapping (e.g. resizing)]

- I didn't touch the compiler/bytecode. I allocate caches with 
len(co_names) entries. LOAD_GLOBAL args run over [0,len(co_names)[.

- the basic invariant is that as long as the version tokens of a pair of 
globals/builtins dicts is unchanged, a LOAD_GLOBAL would retrieve its value 
*at the same* dict entry address. So the LOAD_GLOBAL fast-path becomes:

		case LOAD_GLOBAL:
			if (((PyDictObject *)f->f_globals)->version == glover &&
			    ((PyDictObject *)f->f_builtins)->version == bltver) {
				if(glocache[oparg]) {
					TRC_GLOCACHE("using cache\n");
					x = glocache[oparg]->me_value;
					Py_INCREF(x);
					PUSH(x);
					continue;
				}
			} else

- caches are wrapped in CObjects (quick hack) to get proper collection. 
They are stored and shared through code objects together with the 
globals/builtins version tokens they correspond to. The invariant and that 
a cache is *only* used for a given pair of version tokens, should guarantee 
that sharing is fine. If the versions tokens change during the run of some 
code a new fresh cache is created and used. The assumptions is that this is 
rare after "startup" phase.
Code that pay a price are e.g. modules that in their top-level setup some 
globals, call a function, add some more globals, recall the function... 
because each function call will create and start populating  a new fresh cache.

- although I have tried to trigger DECREFs (that could run arbitrary code) 
only from consistent state, that's a delicate aspect. Similar is the 
problem to have version tokens timely updated.

- ceval switch is indeed an unwieldy beast. For sure I did'nt manage (nor 
really tried) to find the best code layout with respect to cpu cache/code 
generation. The problem is that slowdowns for not-fast-paths (and sadly 
even unrelated stuff) depend on this.

It's a two day work (& my C is a bit rusty) so some bug/issues could be 
lurking, OTOH the test suite passes. I have no more time to further play 
with this, is not really in general consumption shape but if someone is 
interested I can post the diff (is less than 1000 lines together with context).

regards.