About alternatives to Matlab

Fri Dec 15 11:54:47 EST 2006

Filip Wasilewski wrote:
> Jon Harrop wrote:
>> Yes. The time taken is dominated by memory accesses. The amount of
>> arithmetic is almost irrelevant.
> 
> I can say from my experience, that this depends much on the input size
> and the CPU cache size. For arrays of about n^18 or less and 2MB cache
> the speedup for the straightforward C lifting implementation is around
> 50%, while all operations are done in-place. For bigger arrays
> traversing memory several times becomes too expensive, at least until
> the input size is so big that the memory gets swapped and the inplace
> method becomes faster again. I suppose that such behavior should be
> quite similar for other low-level implementations as well.

With my two OCaml implementations, the lifting scheme is always slower. Here
is n vs t for convolution (i.e. a simple loop):

{{4, 0.0000001},
{8, 0.0000003},
{16, 0.0000006},
{32, 0.0000012},
{64, 0.0000023},
{128, 0.0000046},
{256, 0.0000089},
{512, 0.0000178},
{1024, 0.0000355},
{2048, 0.0000720},
{4096, 0.0001445},
{8192, 0.0002968},
{16384, 0.0005956},
{32768, 0.0012225},
{65536, 0.0029214},
{131072, 0.0062959},
{262144, 0.0128418},
{524288, 0.0259961},
{1048576, 0.0519921},
{2097152, 0.1037343},
{4194304, 0.2064685},
{8388608, 0.4059380}}

and the same for the lifting scheme with deforesting optimisations:

{{4, 0.0000006},
{8, 0.0000013},
{16, 0.0000026},
{32, 0.0000048},
{64, 0.0000091},
{128, 0.0000174},
{256, 0.0000341},
{512, 0.0000681},
{1024, 0.0001373},
{2048, 0.0002736},
{4096, 0.0005483},
{8192, 0.0010936},
{16384, 0.0022086},
{32768, 0.0044017},
{65536, 0.0087643},
{131072, 0.0176380},
{262144, 0.0357446},
{524288, 0.0720515},
{1048576, 0.1449780},
{2097152, 0.2912058},
{4194304, 0.5824115},
{8388608, 1.1698220}}

The results will vary between languages, of course, but from this it looks
like convolution is ~3x faster for 4 <= n <= 2^23.

Can convolution be implemented efficiently in Python?

> Actually I started to wonder if I could automate conversion of
> multi-loop lifting schemes [1] into one-pass loop.

See below.

> This could be done 
> by delaying computation of the "coming next" step just by few
> units, until values from the preceding step are ready to use.

Exactly.

Functional programming makes this easy. You just compose closures from
closures instead of arrays from arrays.

I discussed exactly this technique in the context of the travelling salesman
problem in my book "OCaml for Scientists".

>> I used a more efficient approach because I could. Using a compiled
>> language gave me that choice. We should compare both approaches in both
>> languages.
> 
> I am getting a bit tired of this discussion. I am not arguing which
> approach is faster, because it very much depends on the use case.

Compiled code is virtually always faster.

> I 
> just ask you to do fair comparison using the original algorithm
> presented here instead of forcing your arguments with the
> not-so-relevant variant convenient for you.

This is the third time that I've presented both variants. Now I've presented
detailed timings as well. Where's the other Python implementation?

> Every language (or at least vast majority of them) that has loops and
> basic arithmetic gives you such choice, no matter compiled or not.

I am under the impression that it is prohibitively slow in Python.

> If the OCaml is so superior as you claim,

I'm not claiming that OCaml is superior. I took the original Matlab vs
Python language performance comparison and showed that C++, OCaml and F#
all outperform both Matlab and Python. I'm keen to see what advantages
Python has.

> please show me how to do the lifting scheme 
> (preferably in the
> http://groups.google.com/group/comp.lang.python/msg/a71a5bf00a88ad57
> form with features described in my previous post) so I could actually
> learn something new and maybe even start to consider it as a possible
> alternative to C-ish language family.

Here's a starter. Your Python translated directly to functional OCaml/F#
that simply constructs new arrays at every step:

let rec d4_aux cA = function
  | 0 -> [cA]
  | n ->
      let n = n/2 in
      let f = init n in
      let even = f (fun i -> cA.(2*i)) in
      let odd = f (fun i -> cA.(2*i + 1)) in
      let even = f (fun i -> even.(i) +. c1 *. odd.(i)) in
      let odd = f (function
                   | 0 -> odd.(0) -. c2 *. even.(0) -. c3 *. even.(n-1)
                   | i -> odd.(i) -. c2 *. even.(i) -. c3 *. even.(i-1)) in
      let even = f (function
                    | i when i=n-1 -> even.(i) -. odd.(0)
                    | i -> even.(i) -. odd.(i+1)) in
      let even = f (fun i -> c4 *. even.(i)) in
      let odd = f (fun i -> c5 *. odd.(i)) in
      odd :: d4_aux even n;;

Note the use of the "init" function, allocating and filling new arrays at
every step of the computation.

The advantages of the above code are not immediately apparent. However this
is trivially deforested (i.e. multi-pass -> single pass) by accumulating
new closures "even" and "odd" rather than new arrays:

let rec d4_aux cA = function
  | 1 -> [cA]
  | n ->
      let n = n/2 in
      let even i = cA.(2*i) in
      let odd i = cA.(2*i + 1) in
      let even i = even i +. c1 *. odd i in
      let odd = function
        | 0 -> odd 0 -. c2 *. even 0 -. c3 *. even (n-1)
        | i -> odd i -. c2 *. even i -. c3 *. even (i-1) in
      let even = function
        | i when i = n-1 -> even (i) -. odd (0)
        | i -> even (i) -. odd (i+1) in
      let even = init n (fun i -> c4 *. even i) in
      let odd = init n (fun i -> c5 *. odd i) in
      odd :: d4_aux even n;;

The code is almost identical but five of the seven O(n) array allocations
have been replaced with O(1) closure allocations. The change simply
replaces array lookup "a.(i)" with function invocation "a i".

On my machine, your original Python took 2.49s, the forested OCaml above
takes 1.34s and the deforested OCaml takes 1.16s. The deforesting
optimisation is only possible because OCaml is compiled. If you do the same
transformation in Python the program will become vastly slower because the
closures must be interpreted for each array element.

This is what I meant by optimising Python takes you in the wrong direction
(foresting). I think you're only using slice-based rewriting because it is
fast in Python, not because it is a good idea. So, is it worth adding
slicing to F#?

> As far as the DWT only is concerned, well, If I were to choose the most
> efficient approach (in both the CPU time and the development time
> terms) for Python I would use a ready extension instead of developing
> the wheel once again. Let me think, that should go something like this:
> 
> import pywt
> coeffs = pywt.wavedec(x, 'db2', 'per')

Is there a difference between this in Python and in any other language?

>>> ...
>> That is also an abstraction of get/set.
> 
> Oh my, are all these abstractions to be put into reality every time
> from a scratch?

This entails converting "a.(i)" to "a i", so it isn't a big deal.

>> > axis and maximum decomposition level as input
>>
>> I don't know what that is.
> 
> When doing 1D operations on a multidimensional array it is often
> convenient to be able to specify along which axis these operations
> should be performed, for example:
> 
>>>> x = numpy.ones((2,4))
>>>> x.sum(axis=0)
> array([ 2.,  2.,  2.,  2.])
>>>> x.sum(axis=1)
> array([ 4.,  4.])

I see. This can also be accomplished using functional programming. You have
an array lookup that accepts an array of indices and build a closure that
fills in all of the dimensions except the one you're looping over.

The main difference is probably that I'd statically type the dimensionality
of the arrays. I'll have a play with this.

> The maximum level denotes how many decomposition steps should be
> performed for multilevel transform. Just like described on
> http://www.mathworks.com/access/helpdesk/help/toolbox/wavelet/wavedec.html

What about a lazy list?

>> > with transform coefficients for every level of
>> > decomposition. Can do?
>>
>> Sure. Sounds like a lot of work though.
> 
> See. It's for sure pretty much work to create a general solution in C.

I looked at some of the sophisticated wavelet libraries when I did my PhD on
wavelets and they were seriously complicated. Lisp might also be worth
considering because you could easily compile efficient implementations of
transforms that you were interested in.

> It is a lot easier to focus on the real problem when using Python (and
> NumPy or other package from a wide range of available extensions) and
> save yourself getting much into implementation details while still
> maintaining reasonable performance.

I'm not sure, but you might find it better to factor what you're doing using
functional programming constructs. The results will probably be almost as
easy to use whilst being easier to optimise, e.g. by deforesting. If you're
a researcher then I'm sure there's some new work to be done in this area...

> That is one of the most important 
> reasons I prefer to use high-level language for most of my tasks. I
> also try to avoid doing premature optimizations that could spare some
> CPU cycles, just to see in a few moments that the performance gain is
> completely swallowed by I/O overhead on some remote data access point.

Absolutely. Good factorisation of your code will let you do that, as well as
leveraging existing libraries as you already are. You will probably miss
numpy if you try OCaml though. I'm keen to hear what you'd miss about numpy
though, as we can add it to F#. :-)

> For the rest of intensive number crunching, if there is no specialized
> extension available yet, I still prefer to use the Python/Pyrex/C or
> similar combo to get solution that can be easily integrated with other
> parts of a bigger system (like networking, databases, reports, web,
> etc. - wow, how many things the transition from C++ and Java few years
> ago made just so accessible) or just used to play in the interactive
> mode.

That makes perfect sense. I think F# could become a better Python but I need
to understand the relative benefits offered by Python first.

>> I can't get that to work in Python. What do I need to do?
> 
> Type exactly as presented above, starting from the very first line. The
> `from module import ...` imports chosen identifiers into current
> namespace, so you don't have to prefix them with module name.

Sorry, I mistook the first line for English. :-)

>> Well, OCaml does ok on the sudoku but K does awfully on the ray tracer.
> 
> That's pretty natural. Different languages are designed to solve
> different tasks. I hope your point is not to convince people to use
> OCaml to solve (or complicate) every possible problem.

Not at all. I haven't used OCaml for a while now because I'm working with F#
instead. Both languages have a lot of merits but the time is ripe to add
the features that you're touting from Python to F# because it is still
evolving. We've yet to consider Scheme/Lisp for this task...

>> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Bigarray.html
> 
> Yes, I found this before but it is not very helpful to start with. Any
> doc with basic use cases and examples would be much better. Think of it
> as a possible restraint for people trying to learn the language.

I'll try to find one.

>> You probably want to use F# though.
> 
> On the other hand, I wonder how this integrates with other .net
> languages like C# or IronPython.

Seamlessly. C# is very easy (look at my examples, which use C# interfaces
from .NET). FFI to C is also very easy. There is a nice example of
interfacing to LAPACK in the F# distro.

-- 
Dr Jon D Harrop, Flying Frog Consultancy
Objective CAML for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists/index.html?usenet