[Python-Dev] Python 2.3 release schedule

Guido van Rossum guido@python.org
Sat, 25 May 2002 12:41:23 -0400


> > Piling more complexity on is likely to slow down the common case though.
> 
> Not at all. The common case (the macro) will not have a single
> instruction added.  The rare case (fallback function) will be slowed
> down by two instructions: decrement a global variable, jump if zero.
> The reshuffle will be triggered every 100th occurence of the rare
> case (which may be every 10000th occurence of the common case).
> 
> It's the same approach used when designing a memory cache for a CPU:
> it's OK if a cache miss is made slower by loading an entire cache
> row as long as it improves the cache hit rate enough to justify it.

OK, OK.

I timed pystone, and it's about 1.5% faster.  If that's the best we
can do (and I think that any of the PEPs trying to optimize global and
builtin access probably have about the same improvement rate as your
code), I'm not sure we should worry about this.  I'd rather try to
make method calls and instance attribute (both methods and instance
variables) access faster...

> > > For each dictionary, the number of negative entries is bound by the
> > > number of global and builtin names in the co_names of all code
> > > objects that get executed with that dictionary as their namespace.
> > 
> > And that's my worry.  Suppose a program has only one global but
> > touches every builtin.  Will the dictionary properly get resized to
> > accommodate all those negative entries?
> 
> >>>len(dir(__builtins__))*24  # sizeof PyDictEntry on i386
> 2736
> 
> Caching of string hashes also costs memory. Speed optimizations
> often do.  These costs need to be weighed against complexity,
> performance, and even the code size - other alternatives may use up
> as much code space as this one uses data in this worst-case
> scenario.

I wasn't worried about size; I was worried about correctness.  But it
appears that your code is fine.

BTW, here's a benchmarking tip.  Instead of timing this:

  for i in xrange(10000000):
      hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ;
      hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ;

you should time the for loop in this code:

  x = [0]*10000000
  for i in x:
      hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ;
      hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ; hex ;

The xrange iterator allocates (and implicitly deallocates) an int per
iteration, and you don't use that int at all.  (10 million
pointers is 40 MB -- if that's too much, it's OK to test with 1
million iterations.  Quick outcome:

CVS python 2.3:
builtin: 8.480
global: 6.780
local: 5.080
fast: 2.890

With your patch:
builtin: 4.660
global: 4.160
local: 4.020
fast: 3.000

Hm, this reports:
0 inline dictionary lookups
1568 fast dictionary lookups
121 slow dictionary lookups
0.00% inline
7.16% slow
created 308 string dicts
converted 32 to normal dicts
10.39% conversion rate

Somehow the counts seem to be off by several orders of magnitude.
What *do* these numbers report?

Also note that your patch slows down fast access (by 3%)!  How can it?
Adding more code to the interpreter's inner loop changes the cache
behavior, etc.  Tim Peters can tell you more about this.

--Guido van Rossum (home page: http://www.python.org/~guido/)