[Numpy-discussion] numpy ufuncs and COREPY - any info?

Thu May 28 02:44:48 EDT 2009

A Tuesday 26 May 2009 15:14:39 Andrew Friedley escrigué:
> David Cournapeau wrote:
> > Francesc Alted wrote:
> >> Well, it is Andrew who should demonstrate that his measurement is
> >> correct, but in principle, 4 cycles/item *should* be feasible when using
> >> 8 cores in parallel.
> >
> > But the 100x speed increase is for one core only unless I misread the
> > table. And I should have mentioned that 400 cycles/item for cos is on a
> > pentium 4, which has dreadful performances (defective L1). On a much
> > better core duo extreme something, I get 100 cycles / item (on a 64 bits
> > machines, though, and not same compiler, although I guess the libm
> > version is what matters the most here).
> >
> > And let's not forget that there is the python wrapping cost: by doing
> > everything in C, I got ~ 200 cycle/cos on the PIV, and ~60 cycles/cos on
> > the core 2 duo (for double), using the rdtsc performance counter. All
> > this for 1024 items in the array, so very optimistic usecase (everything
> > in cache 2 if not 1).
> >
> > This shows that python wrapping cost is not so high, making the 100x
> > claim a bit doubtful without more details on the way to measure speed.
>
> I appreciate all the discussion this is creating.  I wish I could work
> on this more right now; I have a big paper deadline coming up June 1
> that I need to focus on.
>
> Yes, you're reading the table right.  I should have been more clear on
> what my implementation is doing.  It's using SIMD, so performing 4
> cosine's at a time where a libm cosine is only doing one.  Also I don't
> think libm trancendentals are known for being fast; I'm also likely
> gaining performance by using a well-optimized but less accurate
> approximation.  In fact a little more inspection shows my accuracy
> decreases as the input values increase; I will probably need to take a
> performance hit to fix this.
>
> I went and wrote code to use the libm fcos() routine instead of my cos
> code.  Performance is equivalent to numpy, plus an overhead:
>
> inp sizes      1024    10240   102400  1024000  3072000
> numpy        0.7282   9.6278 115.5976  993.5738 3017.3680
>
> lmcos    1   0.7594   9.7579 116.7135 1039.5783 3156.8371
> lmcos    2   0.5274   5.7885  61.8052  537.8451 1576.2057
> lmcos    4   0.5172   5.1240  40.5018  313.2487  791.9730
>
> corepy   1   0.0142   0.0880   0.9566    9.6162   28.4972
> corepy   2   0.0342   0.0754   0.6991    6.1647   15.3545
> corepy   4   0.0596   0.0963   0.5671    4.9499   13.8784
>
>
> The times I show are in milliseconds; the system used is a dual-socket
> dual-core 2ghz opteron.  I'm testing at the ufunc level, like this:
>
> def benchmark(fn, args):
>    avgtime = 0
>    fn(*args)
>
>    for i in xrange(7):
>      t1 = time.time()
>      fn(*args)
>      t2 = time.time()
>
>      tm = t2 - t1
>      avgtime += tm
>
>    return avgtime / 7
>
> Where fn is a ufunc, ie numpy.cos.  So I prime the execution once, then
> do 7 timings and take the average.  I always appreciate suggestions on
> better way to benchmark things.

No, that seems good enough.  But maybe you can present results in cycles/item.  
This is a relatively common unit and has the advantage that it does not depend 
on the frequency of your cores.

-- 
Francesc Alted