[SciPy-user] Benchmark data

Fri Dec 9 19:16:24 EST 2005

On Fri, 09 Dec 2005 13:01:41 -0700
Travis Oliphant <oliphant.travis at ieee.org> wrote:

> Gerard Vermeulen wrote:
> 
> >On Fri, 09 Dec 2005 03:14:49 -0700
> >Travis Oliphant <oliphant.travis at ieee.org> wrote:
> >  
> >
> >>I'd like people to try out scipy core in SVN.  I made improvements to the
> >>buffered ufunc section of code that I think will make a big difference
> >>in the recently published benchmarks. 
> >>
> >>    
> >>
> >
> >Hi Travis,
> >
> >indeed, it made a big difference (for big arrays scipy is now fastest on some
> >statements).
> >
> >Below are my benchmark results on my DIY python, see
> >http://www.scipy.org/mailinglists/mailman?fn=scipy-user/2005-December/006057.html
> >
> >On my system and for large arrays (>4096), numarray is still fastest, scipy moved
> >to second and Numeric is third.
> >Numeric is still fastest for small arrays, scipy is second, numarray is third.
> >  
> >
> Numeric will always be faster for small-enough arrays, I think, because 
> it doesn't have the ufunc overhead.   I just don't want it to be a lot 
> faster.   We can improve the limiting scalar case in scipy_core using 
> separate scalar math.  It looks like we are doing reasonably well.
>

Agreed, the tiny difference won't stop me from using scipy :-) 

> 
> >Invoking: python bench.py 12
> >Importing test to scipy
> >Importing base to scipy
> >Importing basic to scipy
> >Python 2.4.2 (#1, Dec  4 2005, 08:21:04) 
> >[GCC 3.4.3 (Mandrakelinux 10.2 3.4.3-7mdk)]
> >Optimization flags: -DNDEBUG -O3 -march=i686
> >CPU info: getNCPUs=2 has_mmx has_sse has_sse2 is_32bit is_Intel is_Pentium is_PentiumIV
> >Numeric-24.2
> >numarray-1.5.0
> >scipy-core-0.8.1.1617
> >benchmark size = 12  (vectors of length 16777216)
> >label            Numeric       numarray     scipy.base
> >    1             0.4127        0.07423         0.3927
> >    2             0.2734         0.2321         0.3234
> >    3             0.1975         0.1821         0.2733
> >    4             0.8747         0.5371         0.5588
> >    5             0.2896         0.2342         0.2737
> >    6             0.2066         0.1731         0.2718
> >    7             0.8761         0.6286         0.5524
> >    8             0.6546         0.4556         0.4533
> >    9              9.488          7.566          8.717
> >   10              9.506          8.064          8.745
> >   11              7.879          6.301          7.305
> >TOTAL              30.66          24.45          27.87
> >  
> >
> 
> As mentioned before, it looks like the optimizer is doing something nice 
> on your system.   One issue is arange which could definitely be made 
> faster by having different "fillers" for different types.   I'm still 
> astonished by the markedly different numbers you seem to get than others 
> have shown.  Is this all -O3 optimization kicking in?
> 

Below there is data for a build without optimize options (OPT='-g'), 
with a plain configure invokation (configure sets -O3 by itself) and
a debug build. Above there is data showing the benefits of an additional
-march=i686 (less benefit than I claimed in one of my previous mails).

The compiler flags do not make a big overall difference on my machine,
but a debug build is bad for numarray (easy to explain, since numarray does
more in plain Python, so it will suffer more from Python's debug overhead).

>
> The other issue is the sin and cosine functions.  They don't have their 
> own inner loops.  They call a generic inner loop with a 
> "function-pointer" data.    Perhaps the optimizer can't do as much with 
> that or it needs to written with an optimizer in mind.
>

I understand that gcc uses inline assembler for simple math
functions, so it is certainly something to look into.

> 
> Ultimately, though, I'd like to see some of the inner loops to take 
> advantage of SSE (and equivalent) instructions if the number of 
> iterations is large-enough.    So, yes, I think we could get faster.  
> But, I'd first like to get more data from more machines and compiler 
> flags to determine where the slowness is really coming from.   It might 
> be good, for example, to break up one of lines 9, 10, and 11 so that at 
> least one sin and cos calculation is done alone.
>

I agree that more data is necessary and I remind everybody that my data is
for an Intel CPU and that all other data (David Cooke, Arnd Backer and you)
is for AMD CPU's.

It may be worthwhile to generalize the benchmark program so that it
reads statements from a file and does timed calls to eval(statement).
I am going to play with this idea this weekend.

Travis, I really appreciate how seriously you take this, 

Gerard

PS: the additional benchmark data:

# NO OPTIMIZATION, SET -g
[packer at titan BUILD]$ python bench.py 12
Importing test to scipy
Importing base to scipy
Importing basic to scipy
Python 2.4.2 (#1, Dec  9 2005, 23:29:32)
[GCC 3.4.3 (Mandrakelinux 10.2 3.4.3-7mdk)]
Optimization flags: -DNDEBUG -g
CPU info: getNCPUs=2 has_mmx has_sse has_sse2 is_32bit is_Intel is_Pentium is_PentiumIV
Numeric-24.2
numarray-1.5.0
scipy-core-0.8.1.1617
benchmark size = 12  (vectors of length 16777216)
label            Numeric       numarray     scipy.base
    1             0.4933         0.1256         0.4204
    2             0.3355         0.3553         0.4266
    3             0.2704         0.2815         0.3545
    4             0.9105         0.6438         0.6785
    5             0.3868         0.3442         0.3511
    6             0.2617         0.2815         0.3553
    7             0.9159          0.707         0.6803
    8             0.6202         0.4303         0.4254
    9              11.22          9.597           10.7
   10              11.11          9.906          10.67
   11              9.129          7.836          8.837
TOTAL              35.66          30.51           33.9

# LET configure DECIDE BY ITSELF
[packer at titan BUILD]$ python bench.py 12
Importing test to scipy
Importing base to scipy
Importing basic to scipy
Python 2.4.2 (#1, Dec  9 2005, 23:38:46)
[GCC 3.4.3 (Mandrakelinux 10.2 3.4.3-7mdk)]
Optimization flags: -DNDEBUG -g -O3 -Wall -Wstrict-prototypes
CPU info: getNCPUs=2 has_mmx has_sse has_sse2 is_32bit is_Intel is_Pentium is_PentiumIV
Numeric-24.2
numarray-1.5.0
scipy-core-0.8.1.1617
benchmark size = 12  (vectors of length 16777216)
label            Numeric       numarray     scipy.base
    1             0.4525        0.07366          0.413
    2             0.2674         0.2354         0.3261
    3             0.2022         0.1853         0.2755
    4             0.8649         0.5381         0.5522
    5             0.2831         0.2361         0.2681
    6             0.1919         0.1747         0.2755
    7             0.8809         0.6236         0.5593
    8              0.629         0.4348         0.4341
    9              11.12          9.065          10.34
   10              11.15           9.49          10.37
   11              9.179          7.484          8.599
TOTAL              35.23          28.54          32.41

# DEBUG BUILD, NO OPTIMIZATION
[packer at titan BUILD]$ python bench.py 12
Importing test to scipy
Importing base to scipy
Importing basic to scipy
Python 2.4.2 (#1, Dec  9 2005, 23:48:07)
[GCC 3.4.3 (Mandrakelinux 10.2 3.4.3-7mdk)]
Optimization flags: -g -Wall -Wstrict-prototypes
CPU info: getNCPUs=2 has_mmx has_sse has_sse2 is_32bit is_Intel is_Pentium is_PentiumIV
Numeric-24.2
numarray-1.5.0
scipy-core-0.8.1.1617
benchmark size = 12  (vectors of length 16777216)
label            Numeric       numarray     scipy.base
    1             0.5178         0.1394         0.4421
    2             0.3371         0.3589         0.4191
    3             0.2615         0.2919          0.364
    4             0.9209          0.801         0.6696
    5             0.3899         0.3668          0.361
    6             0.2619         0.2804         0.3552
    7             0.9191         0.9445         0.6808
    8             0.6171         0.6379         0.4235
    9              11.09          11.31          10.69
   10              11.11          11.79          10.69
   11              9.125          9.256          8.836
TOTAL              35.56          36.18          33.94
[47538 refs]
[packer at titan BUILD]$