[SciPy-User] numexpr.evaluate slower than eval, why?

Mon Nov 1 17:26:55 EDT 2010

A Monday 01 November 2010 20:24:51 Gerrit Holl escrigué:
> Hi,
> 
> (since I couldn't find any numexpr mailing-list, I ask the question
> here)
> 
> I am working with pytables and numexpr. I use pytables' .where()
> method to select fields from my data. Sometimes I can't do that and I
> need to select them "by hand", but to keep the interface constant and
> avoid the need to parse things myself, I evaluate the same strings to
> sub-select fields from my data. To my surprise, numexpr.evaluate is
> about two times slower than eval. Why?
> 
> In [130]: %timeit numexpr.evaluate('MEAN>1000', recs)
> 10000 loops, best of 3: 117 us per loop
> 
> In [131]: %timeit eval('MEAN>1000', {}, {'MEAN': recs['MEAN']})
> 10000 loops, best of 3: 55.4 us per loop
> 
> In [132]: %timeit recs['MEAN']>1000
> 10000 loops, best of 3: 42.1 us per loop

There are several causes for this.  First, numexpr is not always faster 
than numpy, but only basically when temporaries enter into the equation 
(that is, when you are evaluating complex expressions basically).  In 
the above expression, you only have a simple expression, with no 
temporaries at all, so you cannot expect a large speed-up when using 
numexpr.

Secondly, if you are getting a 2x slowdown in the above expression is 
probably due to the fact that you are using small inputs (i.e. len(recs) 
is small), and that numexpr is using several threads automatically.  And 
it happens that, for such a small arrays, the current threading code 
introduces an important overhead.

Consider this (using a 2-core machine here):

>>> ne.set_num_threads(2)
>>> a = np.arange(1e3)
>>> timeit ne.evaluate('a>1000')
10000 loops, best of 3: 31.5 µs per loop
>>> timeit eval('a>1000')
100000 loops, best of 3: 19.5 µs per loop
>>> timeit a>1000
100000 loops, best of 3: 4.35 µs per loop

i.e. for small arrays, eval+numpy is faster.  To prove that this is 
mainly due to the overhead of internal threading code, let's force the 
use of a single thread with numexpr:

>>> ne.set_num_threads(1)
>>> timeit ne.evaluate('a>1000')
100000 loops, best of 3: 18.8 µs per loop

which is very close to eval + numpy performance.  Finally, we can see 
how almost all of the evaluation time is wasted during the compilation 
phase:

>>> a = np.arange(1e0)
>>> timeit ne.evaluate('a>1000')
100000 loops, best of 3: 16.4 µs per loop
>>> timeit eval('a>1000')
100000 loops, best of 3: 17.5 µs per loop

[Incidentally, one can see how the numexpr's compiler is slightly faster 
than python's one.  Wow, what a welcome surprise!]

Interestingly enough, things changes dramatically for larger arrays:

>>> ne.set_num_threads(2)
>>> b = np.arange(1e5)
>>> timeit ne.evaluate('b>1000')
10000 loops, best of 3: 97.5 µs per loop
>>> timeit eval('b>1000')
10000 loops, best of 3: 138 µs per loop
>>> timeit b>1000
10000 loops, best of 3: 123 µs per loop

In this case, numexpr is faster than numpy by a 25%.  This speed-up is 
mostly due to the use of several threads automatically (using 2 cores 
and 2 threads above).  Forcing the use of a single thread we have:

>>> ne.set_num_threads(1)
>>> timeit ne.evaluate('b>1000')
10000 loops, best of 3: 112 µs per loop

which is closer to numpy performance (but still a 10% faster, don't know 
exactly why).

So, the lesson to learn here is that, if you work with small arrays and 
want to attain at least the same performance than python's `eval`, then 
you should set the number of threads in numexpr to 1.

Hmm, now that I think about this, it should be interesting if numexpr 
can automatically disable the multi-threading code for small arrays.  
Added the ticket:

http://code.google.com/p/numexpr/issues/detail?id=36

> (on a side-note: what is python/evals definition of a mapping?
> numexpr evaluates recs (a numpy.recarray) as a mapping, but eval
> does not)

Numexpr comes with special machinery to recognize many NumPy's features, 
like automatic detection of strided arrays, or unaligned ones.  In 
particular, structured arrays / recarrays are also recognized and 
computations are optimized based on all this metainfo.  Indeed, Python's 
compiler is ignorant about NumPy objects and hence it has no 
possibilities to apply such optimizations.

-- 
Francesc Alted