[Numpy-discussion] Numpy and PEP 343

Thu Mar 2 10:13:03 EST 2006

eric jones <eric at enthought.com> writes:

> Travis Oliphant wrote:
>
>> Weave can still help with the "auto-compilation" of the specific
>> library for your type.  Ultimately such code will be faster than
>> NumPy can every be.
>
> Yes.  weave.blitz() can be used to do the equivalent of this lazy
> evaluation for you in many cases without much effort.  For example:
>
> import weave
> from scipy import arange
>
> a = arange(1e7)
> b = arange(1e7)
> c=2.0*a+3.0*b
>
> # or with weave
> weave.blitz("c=2.0*a+3.0*b")
>
> As Paul D. mentioned.what Tim outlined is essentially template
> expressions in C++.  blitz++ (http://www.oonumerics.org/blitz/) is a
> C++ template expressions library for array operations, and weave.blitz
> translates a Numeric expression into C++ blitz code.  For the example
> above on large arrays, you get about a factor of 4 speed up on large
> arrays. (Notice, the first time you run the example it will be much
> slower because of compile time.  Use timings from subsequent runs.)
>
> C:\temp>weave_time.py
> Expression: c=2.0*a+3.0*b
> Numeric: 0.678311899322
> Weave: 0.162177084984
> Speed-up: 4.18253848494
>
> All this to say, I think weave basically accomplishes what Tim wants
> with a different mechanism (letting C++ compilers do the optimization
> instead of writing this optimization at the python level).  It does
> require a compiler on client machines in its current form (even that
> can be fixed...), but I think it might prove faster than
> re-implementing a numeric expression compiler at the python level
> (though that sounds fun as well).

I couldn't leave it at that :-), so I wrote my bytecode idea up.

You can grab it at http://arbutus.mcmaster.ca/dmc/software/numexpr-0.1.tar.gz

Here are the performance numbers (note I use 1e6 elements instead of
1e7)

cookedm at arbutus$ py -c 'import numexpr.timing; numexpr.timing.compare()'
Expression: b*c+d*e
numpy: 0.0934900999069
Weave: 0.0459051132202
Speed-up of weave over numpy: 2.03659447388
numexpr: 0.0467489004135
Speed-up of numexpr over numpy: 1.99983527056

Expression: 2*a+3*b
numpy: 0.0784019947052
Weave: 0.0292909860611
Speed-up of weave over numpy: 2.67665945222
numexpr: 0.0323888063431
Speed-up of numexpr over numpy: 2.42065094572

Wow. On par with weave.blitz, and no C++ compiler!!! :-)

You use it like this:

from numexpr import numexpr

func = numexpr("2*a+3*b")

a = arange(1e6)
b = arange(1e6)
c = func(a, b)

Alternatively, you can use it like weave.blitz, like this:

from numexpr import evaluate

a = arange(1e6)
b = arange(1e6)
c = evaluate("2*a+3*b")

How it works
============

Well, first of all, it only works with the basic operations (+, -, *,
and /), and only on real constants and arrays of doubles. (These
restrictions are mainly for lack of time, of course.)

Given an expression, it first compiles it to a program written in a
small bytecode. Here's what it looks like:

In [1]: from numexpr import numexpr
In [2]: numexpr("2*a+3*b", precompiled=True)
# precompiled=True just returns the program before it's made into a
# bytecode object, so you can what it looks like
Out[2]:
[('mul_c', Register(0), Register(1, a), Constant(0, 2)),
 ('mul_c', Temporary(3), Register(2, b), Constant(1, 3)),
 ('add', Register(0), Register(0), Temporary(3))]
In [3]: c = numexpr("2*a+3*b")
In [4]: c.program
Out[4]: '\x08\x00\x01\x00\x08\x03\x02\x01\x02\x00\x00\x03'
In [5]: c.constants
Out[5]: array([ 2.,  3.])
In [6]: c.n_temps
Out[6]: 1
In [7]: type(c)
Out[7]: <type 'numexpr.NumExpr'>

The bytecode program is a string of 4-character groups, where the
first character of each group is the opcode, the second is the
"register" to store in, and the third and fourth are the register
numbers to use as arguments. Inputs and outputs are assigned to
registers (so in the example above, the output is Register(0), and the
two inputs are Register(1) and Register(2)), and temporaries are in
remaining registers. The group of registers is actually an array of
double*.

The secret behind the speed is that the bytecode program is run
blocked: it's run repeatedly, operating on 128 elements of each input
array at at time (then 8 elements then 1 element for the leftover).
This does a tradeoff between cache misses (the arguments for each time
through the program will most likely be able to fit in the cache each
time), and the overhead of the branching from evaluating the program.
It's like doing this in numpy:

c[0:128] = 2*a[0:128] + 3*b[0:128]
c[128:256] = 2*a[128:256] + 3*b[128:256]
etc.

I could see this getting a nice speedup from using a CPU's vector
instructions. For instance, it should be pretty easy to use liboil for
the intermediate vector evaluations.

If people think it's useful enough, I can check it into scipy's sandbox.

-- 
|>|\/|<
/--------------------------------------------------------------------------\
|David M. Cooke                      http://arbutus.physics.mcmaster.ca/dmc/
|cookedm at physics.mcmaster.ca