About alternatives to Matlab

Mon Dec 4 21:08:48 EST 2006

Jon Harrop wrote:

> So the super-fast BLAS routines are now iterating over the arrays many times
> instead of once and the whole program is slower than a simple loop written
> in C.

Yes. And the biggest overhead is probably Python's function calls.

Little is as efficient as well-written ISO C99 (not to be confused with
C++ or ANSI C). The only thing I can think of is hand-written assembly.
But whether a single loop is faster than several BLAS calls does not
only depend on how many times the memory is traversed. It also depends
a lot on cache prefetching. So I assume you make sure that the cache is
prefetched and exploited optimally for your CPU in your C code? And you
are presumably also inlining assembly to use SSE and other SIMD
opcodes? After all, SSE can speed up floating point performance by a
factor of four if the computations can be carried out in parallel. At
one point, one has to stop and ask a very important question:

How fast is fast enough? And how much suffering am I willing to endure
to achieve that speed?

Bad Fortran 95 can easily be 25% lower than well-tuned C99. But how
much time was saved writing and debugging the code? I value my own time
more than a few extra CPU cycles. With Fortran 95, it is easy to write
numerical code that runs reasonably fast. It is only a little bit
harder to write numerical code that runs very fast. There is nothing in
Fortran that prevents you from writing loops that are equally fast or
faster than ANSI C loops. If you have several subsequent slicing
operations, a good Fortran compiler can detect that and fuse them into
a single loop. Never underestimate the ability of a modern Fortran
compiler to generate efficient machine code, even if array slicing is
used heavily.

Also never underestimate the truth in Knuth's claim that premature
optimization is the root of all evil in computer programming. It is a
lot smarter to write a pure Python/NumPy solution, and if its too slow,
run the profiler, identify the worst bottlenecks, and translate them to
Fortran. The result will probably not be as fast as a pure C solution.
But if it is still too slow, money is better spent buying better
hardware than attempting a major rewrite in C/C99/C++.

NumPy can be slower than equivalent C by an order of magnitude. But
still, that may be more than fast enough. The beauty of NumPy is really
the way it allows arrays of numerical data to be created and
manipulated within Python, not the speed of python code using NumPy
objects. We have already seen that NumPy is faster than the most recent
version of Matlab, the most expensive commercial tool on the marked;
which means, it is as fast as you can expect a numerical scripting tool
to be, but sure, there is still room for improvement.

In a numerical scripting language it is certainly not true that several
slicing operations is slower than a single big for-loop. That is due to
the overhead involved when the interpreter is invoked. The performance
is to a large extent determined by the amount of interpreter
invocations. That is why it is preferred to "vectorize" NumPy code
using slicing, repmat, reshapes, etc., rather than looping with a
pythonic for-loop. Even if you have to iterate over the memory several
times, array slicing will still be a lot faster than a pythonic
for-loop.

> Indeed, this algorithm could be written concisely using pattern matching
> over linked lists. I wonder if that would be as fast as using BLAS from
> Python.

Pattern matching on linked lists? For wavelet transforms? Seriously?