[Numpy-discussion] Why is numpy.abs so much slower on complex64 than complex128 under windows 32-bit?

Tue Apr 10 11:36:56 EDT 2012

On 4/10/12 6:44 AM, Henry Gomersall wrote:
> Here is the body of a post I made on stackoverflow, but it seems to be 
> a non-obvious issue. I was hoping someone here might be able to shed 
> light on it...
>
> On my 32-bit Windows Vista machine I notice a significant (5x) 
> slowdown when taking the absolute values of a fairly large 
> |numpy.complex64| array when compared to a |numpy.complex128| array.
>
> |>>>  import  numpy
> >>>  a=  numpy.random.randn(256,2048)  +  1j*numpy.random.randn(256,2048)
> >>>  b=  numpy.complex64(a)
> >>>  timeit c=  numpy.float32(numpy.abs(a))
> 10  loops,  best of3:  27.5  ms per loop
> >>>  timeit c=  numpy.abs(b)
> 1  loops,  best of3:  143  ms per loop
> |
>
> Obviously, the outputs in both cases are the same (to operating 
> precision).
>
> I do not notice the same effect on my Ubuntu 64-bit machine (indeed, 
> as one might expect, the double precision array operation is a bit 
> slower).
>
> Is there a rational explanation for this?
>
> Is this something that is common to all windows?
>

I cannot tell for sure, but it looks like the windows version of NumPy 
is casting complex64 to complex128 internally.  I'm guessing here, but 
numexpr lacks the complex64 type, so it has to internally do the upcast, 
and I'm seeing kind of the same slowdown:

In [6]: timeit numpy.abs(a)
100 loops, best of 3: 10.7 ms per loop

In [7]: timeit numpy.abs(b)
100 loops, best of 3: 8.51 ms per loop

In [8]: timeit numexpr.evaluate("abs(a)")
100 loops, best of 3: 1.67 ms per loop

In [9]: timeit numexpr.evaluate("abs(b)")
100 loops, best of 3: 4.96 ms per loop

In my case I'm seeing only a 3x slowdown, but this is because numexpr is 
not re-casting the outcome to complex64, while windows might be doing 
this.  Just to make sure, can you run this:

In [10]: timeit c = numpy.complex64(numpy.abs(numpy.complex128(b)))
100 loops, best of 3: 12.3 ms per loop

In [11]: timeit c = numpy.abs(b)
100 loops, best of 3: 8.45 ms per loop

in your windows box and see if they raise similar results?

> In a related note of confusion, the times above are notably (and 
> consistently) different (shorter) to that I get doing a naive `st = 
> time.time(); numpy.abs(a); print time.time()-st`. Is this to be expected?
>

This happens a lot, yes, specially when your code is memory-bottlenecked 
(a very common situation).  The explanation is simple: when your 
datasets are small enough to fit in CPU cache, the first time the timing 
loop runs, it brings all your working set to cache, so the second time 
the computation is evaluated, the time does not have to fetch data from 
memory, and by the time you run the loop 10 times or more, you are 
discarding any memory effect.  However, when you run the loop only once, 
you are considering the memory fetch time too (which is often much more 
realistic).

-- 
Francesc Alted

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120410/a39a8aa2/attachment.html>