User class binary ops seem too slow (was re: GIL detector)

Mon Aug 18 00:43:58 EDT 2014

In a post about CPython's GIL, Steven D'Aprano pointed to  Armin 
Ronacher's criticism of the internal type slots used for dunder methods.

 > http://lucumr.pocoo.org/2014/8/16/the-python-i-would-like-to-see/

I found the following interesting.
"
Since we have an __add__ method the interpreter will set this up in a 
slot. So how fast is it? When we do a + b we will use the slots, so here 
is what it times it as:

$ python3 -mtimeit -s 'from x import A; a = A(); b = A()' 'a + b'
1000000 loops, best of 3: 0.256 usec per loop

If we do however a.__add__(b) we bypass the slot system. Instead the 
interpreter is looking in the instance dictionary (where it will not 
find anything) and then looks in the type's dictionary where it will 
find the method. Here is where that clocks in at:

$ python3 -mtimeit -s 'from x import A; a = A(); b = A()' 'a.__add__(b)'
10000000 loops, best of 3: 0.158 usec per loop

Can you believe it: the version without slots is actually faster. What 
magic is that? I'm not entirely sure what the reason for this is,"

Curious myself, I repeated the result on my Win7 machine and got almost 
the same numbers.

 >>> class A:
	def __add__(self, other): return 2
 >>> timeit.repeat('a + b', 'from __main__ import A; a=A(); b=A()')
[0.26080520927348516, 0.24120280310165754, 0.2412111032140274]
 >>> timeit.repeat('a.__add__(b)', 'from __main__ import A; a=A(); b=A()')
[0.17656398710346366, 0.15274235713354756, 0.1528444177747872]

First I looked at the byte code.

 >>> dis('a+b')
   1           0 LOAD_NAME                0 (a)
               3 LOAD_NAME                1 (b)
               6 BINARY_ADD
               7 RETURN_VALUE
 >>> dis('a.__add__(b)')
   1           0 LOAD_NAME                0 (a)
               3 LOAD_ATTR                1 (__add__)
               6 LOAD_NAME                2 (b)
               9 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
              12 RETURN_VALUE

Next the core of BINARY_ADD add code in Python/ceval.c:
             if (PyUnicode_CheckExact(left) &&
                      PyUnicode_CheckExact(right)) {
                 sum = unicode_concatenate(left, right, f, next_instr);
                 /* unicode_concatenate consumed the ref to v */
             }
             else {
                 sum = PyNumber_Add(left, right);

By the language definition, PyNumber_Add must whether 
issubclass(type(b), type(a)). If so, it tries b.__radd__(a). Otherwise 
it tries a.__add__(b). BINARY_ADD has extra overhead before it calls 
__add__. Enough to explain the differnce between .09 microsecond 
difference between .25 and .16 microseconds?

Lets try some builtins..

 >>> timeit.repeat('1+1')
[0.04067762117549266, 0.019206152658126363, 0.018796680446902643]
 >>> timeit.repeat('1.0+1.0')
[0.032686457413774406, 0.023207729064779414, 0.018793606331200863]
 >>> timeit.repeat('(1.0+1j) + (1.0-1j)')
[0.037775348543391374, 0.01876409482042618, 0.018812358436889554]

 >>> timeit.repeat("''+''")
[0.04073695160855095, 0.018977745861775475, 0.018800676797354754]
 >>> timeit.repeat("'a'+'b'")
[0.04066932106320564, 0.01896145304840502, 0.01879268409652468]

 >>> timeit.repeat('1 .__add__(1)')
[0.16622020259652004, 0.15244908649577837, 0.15047857833215517]
 >>> timeit.repeat("''.__add__('')")
[0.17265801569533323, 0.1535966538865523, 0.15308880997304186]

For the common case of adding builtin numbers and empty strings, the 
binary operation is about 8 times as fast as the dict lookup and 
function call.  For empty lists, the ratio is about 3

 >>> timeit.repeat('[]+[]')
[0.09728684696551682, 0.08233527043626054, 0.08230698857164498]
 >>> timeit.repeat('[].__add__([])')
[0.22780949582033827, 0.2060266193825555, 0.2060967092206738]

Conclusions:
1. Python-level function calls to C wrappers of C functions are as slow 
as calls to Pythons functions (which I already knew to be relatively slow).

2. Taking into account the interpreters internal binary operations on 
builtin Python objects, I suspect most everyone benefits from the slot 
optimization.

3. The total BINARY_ADD + function call time for strings and number, 
about .02 microseconds is much less that the .09 difference and cannot 
account for it.

4. There might be some avoidable overhead within PyNumber_ADD that only 
affects custom-class instances (but I am done for tonight;-).

-- 
Terry Jan Reedy