[Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

Thu Jun 5 22:51:41 CEST 2014

Hi all,

There's a very valuable optimization -- temporary elision -- which
numpy can *almost* do. It gives something like a 10-30% speedup for
lots of common real-world expressions. It would probably would be
useful for non-numpy code too. (In fact it generalizes the str += str
special case that's currently hardcoded in ceval.c.) But it can't be
done safely without help from the interpreter, and possibly not even
then. So I thought I'd raise it here and see if we can get any
consensus on whether and how CPython could support this.

=== The dream ===

Here's the idea. Take an innocuous expression like:

   result = (a + b + c) / c

This gets evaluated as:

   tmp1 = a + b
   tmp2 = tmp1 + c
   result = tmp2 / c

All these temporaries are very expensive. Suppose that a, b, c are
arrays with N bytes each, and N is large. For simple arithmetic like
this, then costs are dominated by memory access. Allocating an N byte
array requires the kernel to clear the memory, which incurs N bytes of
memory traffic. If all the operands are already allocated, then
performing a three-operand operation like tmp1 = a + b involves 3N
bytes of memory traffic (reading the two inputs plus writing the
output). In total our example does 3 allocations and has 9 operands,
so it does 12N bytes of memory access.

If our arrays are small, then the kernel doesn't get involved and some
of these accesses will hit the cache, but OTOH the overhead of things
like malloc won't be amortized out; the best case starting from a cold
cache is 3 mallocs and 6N bytes worth of cache misses (or maybe 5N if
we get lucky and malloc'ing 'result' returns the same memory that tmp1
used, and it's still in cache).

There's an obvious missed optimization in this code, though, which is
that it keeps allocating new temporaries and throwing away old ones.
It would be better to just allocate a temporary once and re-use it:

   tmp1 = a + b
   tmp1 += c
   tmp1 /= c
   result = tmp1

Now we have only 1 allocation and 7 operands, so we touch only 8N
bytes of memory. For large arrays -- that don't fit into cache, and
for which per-op overhead is amortized out -- this gives a theoretical
33% speedup, and we can realistically get pretty close to this. For
smaller arrays, the re-use of tmp1 means that in the best case we have
only 1 malloc and 4N bytes worth of cache misses, and we also have a
smaller cache footprint, which means this best case will be achieved
more often in practice. For small arrays it's harder to estimate the
total speedup here, but 66% fewer mallocs and 33% fewer cache misses
is certainly enough to make a practical difference.

Such optimizations are important enough that numpy operations always
give the option of explicitly specifying the output array (like
in-place operators but more general and with clumsier syntax). Here's
an example small-array benchmark that IIUC uses Jacobi iteration to
solve Laplace's equation. It's been written in both natural and
hand-optimized formats (compare "num_update" to "num_inplace"):

   https://yarikoptic.github.io/numpy-vbench/vb_vb_app.html#laplace-inplace

num_inplace is totally unreadable, but because we've manually elided
temporaries, it's 10-15% faster than num_update. With our prototype
automatic temporary elision turned on, this difference disappears --
the natural code gets 10-15% faster, *and* we remove the temptation to
write horrible things like num_inplace.

What do I mean by "automatic temporary elision"? It's *almost*
possible for numpy to automatically convert the first example into the
second. The idea is: we want to replace

  tmp2 = tmp1 + c

with

  tmp1 += c
  tmp2 = tmp1

And we can do this by defining

   def __add__(self, other):
       if is_about_to_be_thrown_away(self):
           return self.__iadd__(other)
       else:
           ...

now tmp1.__add__(c) does an in-place add and returns tmp1, no
allocation occurs, woohoo.

The only little problem is implementing is_about_to_be_thrown_away().

=== The sneaky-but-flawed approach ===

The following implementation may make you cringe, but it comes
tantalizingly close to working:

bool is_about_to_be_thrown_away(PyObject * obj) {
    return (Py_REFCNT(obj) == 1);
}

In fact, AFAICT it's 100% correct for libraries being called by
regular python code (which is why I'm able to quote benchmarks at you
:-)). The bytecode eval loop always holds a reference to all operands,
and then immediately DECREFs them after the operation completes. If
one of our arguments has no other references besides this one, then we
can be sure that it is a dead obj walking, and steal its corpse.

But this has a fatal flaw: people are unreasonable creatures, and
sometimes they call Python libraries without going through ceval.c
:-(. It's legal for random C code to hold an array object with a
single reference count, and then call PyNumber_Add on it, and then
expect the original array object to still be valid. But who writes
code like that in practice? Well, Cython does. So, this is no-go.

=== A better (?) approach ===

This is a pretty arcane bit of functionality that we need, and it
interacts with ceval.c, so I'm not at all confident about the best way
to do it. (We even have an implementation using libunwind to walk the
C stack and make sure that we're being called from ceval.c, which...
works, actually, but is unsatisfactory in other ways.) I do have an
idea that I *think* might work and be acceptable, but you tell me:

Proposal: We add an API call

    PyEval_LastOpDefinitelyMatches(frame, optype, *args)

which checks whether the last instruction executed in 'frame' was in
fact an 'optype' instruction and did in fact have arguments 'args'. If
it was, then it returns True. If it wasn't, or if we aren't sure, it
returns False. The intention is that 'optype' is a semantic encoding
of the instruction (like "+" or "function call") and thus can be
preserved even if the bytecode details change.

Then, in numpy's __add__, we do:

1) fetch the current stack frame from TLS
2) check PyEval_LastOpDefinitelyMatches(frame, "+", arg1, arg2)
3) check for arguments with refcnt == 1
4) check that all arguments are base-class numpy array objects (i.e.,
PyArray_CheckExact)

The logic here is that step (2) tells us that someone did 'arg1 +
arg2', so ceval.c is holding a temporary reference to the arguments,
and step (3) tells us that at the time of the opcode evaluation there
were no other references to these arguments, and step (4) tells us
that 'arg1 + arg2' dispatched directly to ndarray.__add__ so there's
no chance that anyone else has borrowed a reference in the mean time.

AFAICT PyEval_LastOpDefinitelyMatches can *almost* be implemented now;
the only problem is that stack_pointer is a local variable in
PyEval_EvalFrameEx, and we would need it to be accessible via the
frame object. The easy way would be to just move it in there. I don't
know if this would have any weird effects on speed due to cache
effects, but I guess we could arrange to put it into the same cache
line as f_lasti, which is also updated on every opcode? OTOH someone
has gone to some trouble to make sure that f_stacktop usually
*doesn't* point to the top of the stack, and I guess there must have
been some reason for this. Alternatively we could stash a pointer to
stack_pointer in the frame object, and that would only need to be
updated once per entry/exit to PyEval_EvalFrameEx.

Obviously there are a lot of details to work out here, like what the
calling convention for PyEval_LastOpDefinitelyMatches should really
be, but:

* Does this approach seem like it would successfully solve the problem?
* Does this approach seem like it would be acceptable in CPython?
* Is there a better idea I'm missing?

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org