[SciPy-User] Proposal for a new data analysis toolbox
Sebastian Haase
seb.haase at gmail.com
Thu Nov 25 04:30:03 EST 2010
On Wed, Nov 24, 2010 at 10:30 PM, Sebastian Haase <seb.haase at gmail.com> wrote:
> On Wed, Nov 24, 2010 at 8:57 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>> On Wed, Nov 24, 2010 at 11:32 AM, Sebastian Haase <seb.haase at gmail.com> wrote:
>>> On Wed, Nov 24, 2010 at 8:05 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>>> Brief Sphinx doc of whatever it's called can be found here:
>>>> http://berkeleyanalytics.com/dsna
>>>
>>> I would like to throw in one of my favorite functions that I
>>> implemented years ago using (templated) SWIG:
>>>
>>> mmms() calculates min,max,mean and standard deviation in one run.
>>> While - by using SWIG function templates - it can handle multiple
>>> dtypes efficiently (without data copy) I never even attempted to
>>> handle striding or axes...
>>> Similiarly mmm() ( that is minmaxmean() ) might be also good to have,
>>> if one really needs to not waste the (little?!) extra time of
>>> compiling the sum of the squares (for the std.dev).
>>>
>>> I you added this kind of function to the new toolbox, I would be happy
>>> to benchmark it against my venerable (simpler) SWIG version...
>>
>> What are your timings compared to say mean_1d_float64_axis0(arr)?
>
> Sorry, I don't have Cygwin set up yet -- I would need binaries. I have
> a win32, a win64, lin32 and lin64 platform, I could use to test...
> (iow, no mac)
> -Sebastian
>
OK, apparently I don't even need cython, because the ready-made c src
files are already on githup.
So here are some benchmarks from my quad core linux 64bit (Python 2.5):
In [12]: ds.benchit(verbose=False)
Warning: invalid value encountered in divide
<snip about 100 repeats of this line>
DSNA performance benchmark
DSNA 0.1.0dev
Numpy 1.5.1rc1
Scipy 0.8.0
Speed is numpy (or scipy) time divided by dsna time
NaN means all NaNs
Speed Test Shape dtype NaN?
2.9189 nansum(a, axis=-1) (500,500) int64
3.5088 nansum(a, axis=-1) (10000,) float64
8.7537 nansum(a, axis=-1) (500,500) int32
5.9544 nansum(a, axis=-1) (500,500) float64
6.6559 nansum(a, axis=-1) (10000,) int32
2.2585 nansum(a, axis=-1) (10000,) int64
8.9303 nansum(a, axis=-1) (500,500) float64 NaN
8.2773 nansum(a, axis=-1) (10000,) float64 NaN
3.8125 nanmax(a, axis=-1) (500,500) int64
9.7811 nanmax(a, axis=-1) (10000,) float64
0.1229 nanmax(a, axis=-1) (500,500) int32
9.6016 nanmax(a, axis=-1) (500,500) float64
2.2976 nanmax(a, axis=-1) (10000,) int32
3.0449 nanmax(a, axis=-1) (10000,) int64
10.0007 nanmax(a, axis=-1) (500,500) float64 NaN
10.3841 nanmax(a, axis=-1) (10000,) float64 NaN
3.6968 nanmin(a, axis=-1) (500,500) int64
8.1499 nanmin(a, axis=-1) (10000,) float64
0.1206 nanmin(a, axis=-1) (500,500) int32
8.0156 nanmin(a, axis=-1) (500,500) float64
2.3175 nanmin(a, axis=-1) (10000,) int32
3.0114 nanmin(a, axis=-1) (10000,) int64
9.9174 nanmin(a, axis=-1) (500,500) float64 NaN
10.4548 nanmin(a, axis=-1) (10000,) float64 NaN
27.4548 nanmean(a, axis=-1) (500,500) int64
13.9409 nanmean(a, axis=-1) (10000,) float64
25.8452 nanmean(a, axis=-1) (500,500) int32
14.3663 nanmean(a, axis=-1) (500,500) float64
22.6811 nanmean(a, axis=-1) (10000,) int32
23.1552 nanmean(a, axis=-1) (10000,) int64
46.6657 nanmean(a, axis=-1) (500,500) float64 NaN
22.7000 nanmean(a, axis=-1) (10000,) float64 NaN
8.1311 nanstd(a, axis=-1) (500,500) int64
8.7202 nanstd(a, axis=-1) (10000,) float64
8.2082 nanstd(a, axis=-1) (500,500) int32
11.7259 nanstd(a, axis=-1) (500,500) float64
6.8491 nanstd(a, axis=-1) (10000,) int32
6.2385 nanstd(a, axis=-1) (10000,) int64
88.2903 nanstd(a, axis=-1) (500,500) float64 NaN
25.8934 nanstd(a, axis=-1) (10000,) float64 NaN
In [15]: arr = np.arange(10000, dtype=np.float64)
In [17]: ds.mean(arr)
Out[17]: 4999.5
In [18]: timeit ds.mean(arr)
100000 loops, best of 3: 16.6 us per loop
In [30]: timeit ds.std(arr)
10000 loops, best of 3: 32.8 us per loop
In [32]: timeit ds.min(arr),ds.max(arr),ds.mean(arr)
10000 loops, best of 3: 43.5 us per loop
In [31]: timeit ds.min(arr),ds.max(arr),ds.mean(arr),ds.std(arr)
10000 loops, best of 3: 76.5 us per loop
In [19]: from Priithon import useful
In [20]: timeit useful.mm(arr) # calls numpy reduce
twice - see below -- i.e. not optimized
10000 loops, best of 3: 90.6 us per loop
In [21]: timeit useful.mean(arr) # my SWIG
100000 loops, best of 3: 14 us per loop
In [22]: timeit useful.mmm(arr) # does both of the two
above -- i.e. not optimized
10000 loops, best of 3: 105 us per loop
In [23]: timeit useful.mmms(arr) # my SWIG --
compares to 76us of 'ds' above
# ((still
OK I guess .... but more typing ;-)))
10000 loops, best of 3: 36.2 us per loop
In [25]: useful.mm??
Type: function
Base Class: <type 'function'>
String Form: <function mm at 0x2b0bb90>
Namespace: Interactive
File: /home/shaase/Priithon_27_lin64/Priithon/useful.py
Definition: useful.mm(arr)
Source:
def mm(arr):
"""
returns min,max of arr
"""
arr = N.asarray(arr)
return (N.minimum.reduce(arr.flat), N.maximum.reduce(arr.flat))
In [26]: useful.mmm??
Type: function
Base Class: <type 'function'>
String Form: <function mmm at 0x2b0bc08>
Namespace: Interactive
File: /home/shaase/Priithon_27_lin64/Priithon/useful.py
Definition: useful.mmm(arr)
Source:
def mmm(arr):
"""
returns min,max,mean of arr
"""
arr = _getGoodifiedArray(arr)
#TODO: make nice for memmap
m = S.mean(arr)
return (N.minimum.reduce(arr.flat), N.maximum.reduce(arr.flat), m)
In [27]: useful.mmms??
Type: function
Base Class: <type 'function'>
String Form: <function mmms at 0x2b0bc80>
Namespace: Interactive
File: /home/shaase/Priithon_27_lin64/Priithon/useful.py
Definition: useful.mmms(arr)
Source:
def mmms(arr):
"""
returns min,max,mean,stddev of arr
"""
arr = _getGoodifiedArray(arr)
#TODO: make nice for memmap
mi,ma,me,st = S.mmms( arr )
return (mi,ma,me,st)
In [28]:
In [28]: useful.mean??
Type: function
Base Class: <type 'function'>
String Form: <function mean at 0x2b0b8c0>
Namespace: Interactive
File: /home/shaase/Priithon_27_lin64/Priithon/useful.py
Definition: useful.mean(arr)
Source:
def mean(arr):
arr = _getGoodifiedArray(arr)
return S.mean( arr ) # CHECK if should use ns.mean
-----------------------------------------------
"S" is my C modules, _getGoodifiedArray is a noop if arr already
contiguous, otherwise is simply copies the data.
Cheers,
Sebastian
More information about the SciPy-User
mailing list