[Numpy-discussion] numexpr with the new iterator

Mon Jan 10 06:55:16 EST 2011

A Monday 10 January 2011 11:05:27 Francesc Alted escrigué:
> Also, I'd like to try out the new thread scheduling that you
> suggested to me privately (i.e. T0T1T0T1...  vs T0T0...T1T1...).

I've just implemented the new partition schema in numexpr 
(T0T0...T1T1..., being the original T0T1T0T1...).  I'm attaching the 
patch for this.  The results are a bit confusing.  For example, using 
the attached benchmark (poly.py), I get these results for a common dual-
core machine, non-NUMA machine:

With the T0T1...T0T1... (original) schema:

Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 100000000 points
Using numpy:
*** Time elapsed: 3.497
Using numexpr:
*** Time elapsed for 1 threads: 1.279000
*** Time elapsed for 2 threads: 0.688000

With the T0T0...T1T1... (new) schema:

Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 100000000 points
Using numpy:
*** Time elapsed: 3.454
Using numexpr:
*** Time elapsed for 1 threads: 1.268000
*** Time elapsed for 2 threads: 0.754000

which is around a 10% slower (2 threads) than the original partition.

The results are a bit different on a NUMA machine (8 physical cores, 16 
logical cores via hyper-threading):

With the T0T1...T0T1... (original) partition:

Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 100000000 points
Using numpy:
*** Time elapsed: 3.005
Using numexpr:
*** Time elapsed for 1 threads: 1.109000
*** Time elapsed for 2 threads: 0.677000
*** Time elapsed for 3 threads: 0.496000
*** Time elapsed for 4 threads: 0.394000
*** Time elapsed for 5 threads: 0.324000
*** Time elapsed for 6 threads: 0.287000
*** Time elapsed for 7 threads: 0.247000
*** Time elapsed for 8 threads: 0.234000
*** Time elapsed for 9 threads: 0.242000
*** Time elapsed for 10 threads: 0.239000
*** Time elapsed for 11 threads: 0.241000
*** Time elapsed for 12 threads: 0.235000
*** Time elapsed for 13 threads: 0.226000
*** Time elapsed for 14 threads: 0.214000
*** Time elapsed for 15 threads: 0.235000
*** Time elapsed for 16 threads: 0.218000

With the T0T0...T1T1... (new) partition:

Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 100000000 points
Using numpy:
*** Time elapsed: 3.003
Using numexpr:
*** Time elapsed for 1 threads: 1.106000
*** Time elapsed for 2 threads: 0.617000
*** Time elapsed for 3 threads: 0.442000
*** Time elapsed for 4 threads: 0.345000
*** Time elapsed for 5 threads: 0.296000
*** Time elapsed for 6 threads: 0.257000
*** Time elapsed for 7 threads: 0.237000
*** Time elapsed for 8 threads: 0.260000
*** Time elapsed for 9 threads: 0.245000
*** Time elapsed for 10 threads: 0.261000
*** Time elapsed for 11 threads: 0.238000
*** Time elapsed for 12 threads: 0.210000
*** Time elapsed for 13 threads: 0.218000
*** Time elapsed for 14 threads: 0.200000
*** Time elapsed for 15 threads: 0.235000
*** Time elapsed for 16 threads: 0.198000

In this case, the performance is similar, with perhaps a slight 
advantage for the new partition scheme, but I don't know if it is worth 
to make it the default (probably not, as this partition performs clearly 
worse on non-NUMA machines).  At any rate, both partitions perform very 
close to the aggregated memory bandwidth of NUMA machines (around 10 
GB/s in the above case).

In general, I don't think there is much point in using Intel's TBB in 
numexpr because the existing implementation already hits memory 
bandwidth limits pretty early (around 10 threads in the latter example).

-- 
Francesc Alted
-------------- next part --------------
A non-text attachment was scrubbed...
Name: new_partition.diff
Type: text/x-patch
Size: 3778 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110110/242223ba/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: poly.py
Type: text/x-python
Size: 1620 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110110/242223ba/attachment.py>