[SciPy-User] Proposal for a new data analysis toolbox

Thu Nov 25 04:30:03 EST 2010

On Wed, Nov 24, 2010 at 10:30 PM, Sebastian Haase <seb.haase at gmail.com> wrote:
> On Wed, Nov 24, 2010 at 8:57 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>> On Wed, Nov 24, 2010 at 11:32 AM, Sebastian Haase <seb.haase at gmail.com> wrote:
>>> On Wed, Nov 24, 2010 at 8:05 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>>> Brief Sphinx doc of whatever it's called can be found here:
>>>> http://berkeleyanalytics.com/dsna
>>>
>>> I would like to throw in one of my favorite functions that I
>>> implemented years ago using (templated) SWIG:
>>>
>>> mmms()  calculates min,max,mean and standard deviation in one run.
>>> While - by using SWIG function templates - it can handle multiple
>>> dtypes efficiently (without data copy) I never even attempted to
>>> handle striding or axes...
>>> Similiarly mmm()  ( that is minmaxmean() ) might be also good to have,
>>> if one really needs to not waste the (little?!) extra time of
>>> compiling the sum of the squares (for the std.dev).
>>>
>>> I you added this kind of function to the new toolbox, I would be happy
>>> to benchmark it against my venerable (simpler) SWIG version...
>>
>> What are your timings compared to say mean_1d_float64_axis0(arr)?
>
> Sorry, I don't have Cygwin set up yet -- I would need binaries. I have
> a  win32, a win64, lin32 and lin64 platform, I could use to test...
> (iow, no mac)
> -Sebastian
>

OK, apparently I don't even need cython, because the ready-made c src
files are already on githup.
So here are some benchmarks from my quad core linux 64bit (Python 2.5):

In [12]: ds.benchit(verbose=False)
Warning: invalid value encountered in divide
<snip about 100 repeats of this line>
DSNA performance benchmark
        DSNA  0.1.0dev
        Numpy 1.5.1rc1
        Scipy 0.8.0
        Speed is numpy (or scipy) time divided by dsna time
        NaN means all NaNs
   Speed   Test                  Shape        dtype    NaN?
   2.9189  nansum(a, axis=-1)    (500,500)    int64
   3.5088  nansum(a, axis=-1)    (10000,)     float64
   8.7537  nansum(a, axis=-1)    (500,500)    int32
   5.9544  nansum(a, axis=-1)    (500,500)    float64
   6.6559  nansum(a, axis=-1)    (10000,)     int32
   2.2585  nansum(a, axis=-1)    (10000,)     int64
   8.9303  nansum(a, axis=-1)    (500,500)    float64  NaN
   8.2773  nansum(a, axis=-1)    (10000,)     float64  NaN
   3.8125  nanmax(a, axis=-1)    (500,500)    int64
   9.7811  nanmax(a, axis=-1)    (10000,)     float64
   0.1229  nanmax(a, axis=-1)    (500,500)    int32
   9.6016  nanmax(a, axis=-1)    (500,500)    float64
   2.2976  nanmax(a, axis=-1)    (10000,)     int32
   3.0449  nanmax(a, axis=-1)    (10000,)     int64
  10.0007  nanmax(a, axis=-1)    (500,500)    float64  NaN
  10.3841  nanmax(a, axis=-1)    (10000,)     float64  NaN
   3.6968  nanmin(a, axis=-1)    (500,500)    int64
   8.1499  nanmin(a, axis=-1)    (10000,)     float64
   0.1206  nanmin(a, axis=-1)    (500,500)    int32
   8.0156  nanmin(a, axis=-1)    (500,500)    float64
   2.3175  nanmin(a, axis=-1)    (10000,)     int32
   3.0114  nanmin(a, axis=-1)    (10000,)     int64
   9.9174  nanmin(a, axis=-1)    (500,500)    float64  NaN
  10.4548  nanmin(a, axis=-1)    (10000,)     float64  NaN
  27.4548  nanmean(a, axis=-1)   (500,500)    int64
  13.9409  nanmean(a, axis=-1)   (10000,)     float64
  25.8452  nanmean(a, axis=-1)   (500,500)    int32
  14.3663  nanmean(a, axis=-1)   (500,500)    float64
  22.6811  nanmean(a, axis=-1)   (10000,)     int32
  23.1552  nanmean(a, axis=-1)   (10000,)     int64
  46.6657  nanmean(a, axis=-1)   (500,500)    float64  NaN
  22.7000  nanmean(a, axis=-1)   (10000,)     float64  NaN
   8.1311  nanstd(a, axis=-1)    (500,500)    int64
   8.7202  nanstd(a, axis=-1)    (10000,)     float64
   8.2082  nanstd(a, axis=-1)    (500,500)    int32
  11.7259  nanstd(a, axis=-1)    (500,500)    float64
   6.8491  nanstd(a, axis=-1)    (10000,)     int32
   6.2385  nanstd(a, axis=-1)    (10000,)     int64
  88.2903  nanstd(a, axis=-1)    (500,500)    float64  NaN
  25.8934  nanstd(a, axis=-1)    (10000,)     float64  NaN

In [15]: arr = np.arange(10000, dtype=np.float64)

In [17]: ds.mean(arr)
Out[17]: 4999.5

In [18]: timeit ds.mean(arr)
100000 loops, best of 3: 16.6 us per loop

In [30]: timeit ds.std(arr)
10000 loops, best of 3: 32.8 us per loop

In [32]: timeit ds.min(arr),ds.max(arr),ds.mean(arr)
10000 loops, best of 3: 43.5 us per loop

In [31]: timeit ds.min(arr),ds.max(arr),ds.mean(arr),ds.std(arr)
10000 loops, best of 3: 76.5 us per loop

In [19]: from Priithon import useful

In [20]: timeit useful.mm(arr)                  # calls numpy reduce
twice - see below   -- i.e. not optimized
10000 loops, best of 3: 90.6 us per loop

In [21]: timeit useful.mean(arr)               # my SWIG
100000 loops, best of 3: 14 us per loop

In [22]: timeit useful.mmm(arr)              # does both of the two
above  -- i.e. not optimized
10000 loops, best of 3: 105 us per loop

In [23]: timeit useful.mmms(arr)            # my SWIG         --
compares to 76us of 'ds' above
                                                         #   ((still
OK I guess .... but more typing ;-)))
10000 loops, best of 3: 36.2 us per loop

In [25]: useful.mm??
Type:           function
Base Class:     <type 'function'>
String Form:    <function mm at 0x2b0bb90>
Namespace:      Interactive
File:           /home/shaase/Priithon_27_lin64/Priithon/useful.py
Definition:     useful.mm(arr)
Source:
def mm(arr):
    """
    returns min,max of arr
    """

    arr = N.asarray(arr)
    return (N.minimum.reduce(arr.flat), N.maximum.reduce(arr.flat))

In [26]: useful.mmm??
Type:           function
Base Class:     <type 'function'>
String Form:    <function mmm at 0x2b0bc08>
Namespace:      Interactive
File:           /home/shaase/Priithon_27_lin64/Priithon/useful.py
Definition:     useful.mmm(arr)
Source:
def mmm(arr):
    """
    returns min,max,mean of arr
    """
    arr = _getGoodifiedArray(arr)
    #TODO: make nice for memmap
    m = S.mean(arr)
    return (N.minimum.reduce(arr.flat), N.maximum.reduce(arr.flat), m)

In [27]: useful.mmms??
Type:           function
Base Class:     <type 'function'>
String Form:    <function mmms at 0x2b0bc80>
Namespace:      Interactive
File:           /home/shaase/Priithon_27_lin64/Priithon/useful.py
Definition:     useful.mmms(arr)
Source:
def mmms(arr):
    """
    returns min,max,mean,stddev of arr
    """
    arr = _getGoodifiedArray(arr)
    #TODO: make nice for memmap
    mi,ma,me,st = S.mmms( arr )
    return (mi,ma,me,st)

In [28]:
In [28]: useful.mean??
Type:           function
Base Class:     <type 'function'>
String Form:    <function mean at 0x2b0b8c0>
Namespace:      Interactive
File:           /home/shaase/Priithon_27_lin64/Priithon/useful.py
Definition:     useful.mean(arr)
Source:
def mean(arr):
    arr = _getGoodifiedArray(arr)
    return S.mean( arr )  # CHECK if should use ns.mean

-----------------------------------------------
"S" is my C modules, _getGoodifiedArray is a noop if arr already
contiguous, otherwise is simply copies the data.

Cheers,
Sebastian