[Numpy-discussion] numexpr with the new iterator

Tue Jan 11 06:58:27 EST 2011

A Tuesday 11 January 2011 06:45:28 Mark Wiebe escrigué:
> On Mon, Jan 10, 2011 at 11:35 AM, Mark Wiebe <mwwiebe at gmail.com> 
wrote:
> > I'm a bit curious why the jump from 1 to 2 threads is scaling so
> > poorly.
> > 
> >  Your timings have improvement factors of 1.85, 1.68, 1.64, and
> >  1.79.  Since
> > 
> > the computation is trivial data parallelism, and I believe it's
> > still pretty far off the memory bandwidth limit, I would expect a
> > speedup of 1.95 or higher.
> 
> It looks like it is the memory bandwidth which is limiting the
> scalability.

Indeed, this is an increasingly important problem for modern computers.  
You may want to read:

http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf

;-)

> The slower operations scale much better than faster
> ones.  Below are some timings of successively faster operations. 
> When the operation is slow enough, it scales like I was expecting...
[clip]

Yeah, for another example on this with more threads, see:

http://code.google.com/p/numexpr/wiki/MultiThreadVM

OTOH, I was curious about the performance of the new iterator with 
Intel's VML, but it seems to work decently too:

$ python bench/vml_timing.py (original numexpr, *no* VML support)
*************** Numexpr vs NumPy speed-ups *******************
Contiguous case:         1.72 (mean), 0.92 (min), 3.07 (max)
Strided case:            2.1 (mean), 0.98 (min), 3.52 (max)
Unaligned case:          2.35 (mean), 1.35 (min), 3.31 (max)

$ python bench/vml_timing.py  (original numexpr, VML support)
*************** Numexpr vs NumPy speed-ups *******************
Contiguous case:         3.83 (mean), 1.1 (min), 10.19 (max)
Strided case:            3.21 (mean), 0.98 (min), 7.45 (max)
Unaligned case:          3.6 (mean), 1.47 (min), 7.87 (max)

$ python bench/vml_timing.py (new iter numexpr, VML support)
*************** Numexpr vs NumPy speed-ups *******************
Contiguous case:         3.56 (mean), 1.12 (min), 7.38 (max)
Strided case:            2.37 (mean), 0.09 (min), 7.63 (max)
Unaligned case:          3.56 (mean), 2.08 (min), 5.88 (max)

However, there a couple of quirks here.  1) The original Numexpr 
performs generally faster than the iter version.  2) The strided case is 
quite worse for the iter version.  I've isolated the tests that performs 
worse for the iter version, and here are a couple of samples:

*************** Expression: exp(f3)
                    numpy: 0.0135
            numpy strided: 0.0144
          numpy unaligned: 0.0200
                  numexpr: 0.0020 Speed-up of numexpr over numpy: 6.6584
          numexpr strided: 0.1495 Speed-up of numexpr over numpy: 0.0962
        numexpr unaligned: 0.0049 Speed-up of numexpr over numpy: 4.0859

*************** Expression: sin(f3)>cos(f4)
                    numpy: 0.0291
            numpy strided: 0.0366
          numpy unaligned: 0.0407
                  numexpr: 0.0166 Speed-up of numexpr over numpy: 1.7518
          numexpr strided: 0.1551 Speed-up of numexpr over numpy: 0.2361
        numexpr unaligned: 0.0175 Speed-up of numexpr over numpy: 2.3246

Maybe you can shed some light on what's going on here (shall we discuss 
this off-the-list so as to not bore people too much?).

-- 
Francesc Alted