[SciPy-User] Proposal for a new data analysis toolbox

josef.pktd at gmail.com josef.pktd at gmail.com
Mon Nov 22 10:52:33 EST 2010


On Mon, Nov 22, 2010 at 10:35 AM, Keith Goodman <kwgoodman at gmail.com> wrote:
> This thread started on the numpy list:
> http://mail.scipy.org/pipermail/numpy-discussion/2010-November/053958.html
>
> I think we should narrow the focus of the package by only including
> functions that operate on numpy arrays. That would cut out date
> utilities, label indexing utilities, and binary operations with
> various join methods on the labels. It would leave us with three
> categories: faster versions of numpy/scipy nan functions, moving
> window statistics, and group functions.
>
> I suggest we add a fourth category: normalization.
>
> FASTER NUMPY/SCIPY NAN FUNCTIONS
>
> This work is already underway: http://github.com/kwgoodman/nanny
>
> The function signatures for these are easy: we copy numpy, scipy. (I
> am tempted to change nanstd from scipy's bias=False to ddof=0.)

scipy.stats.nanstd is supposed to switch to ddof, so don't copy
inconsistent signatures that are supposed to be depreciated.

I would like statistics (scipy.stats and statsmodels) to stick with
default axis=0.
I would be in favor of axis=None for nan extended versions of numpy
functions and axis=0 for stats functions as defaults, but since it
will be a standalone package with wider usage, I will be able to keep
track of axis=-1.

Josef

>
> I'd like to use a partial sort for nanmedian. Anyone interested in coding that?
>
> dtype: int32, int64, float 64 for now
> ndim: 1, 2, 3 (need some recursive magic for nd > 3; that's an open
> project for anyone)
>
> MOVING WINDOW STATISTICS
>
> I already have doc strings and unit tests
> (https://github.com/kwgoodman/la/blob/master/la/farray/mov.py). And I
> have a cython prototype that moves the window backwards so that the
> stats can be filled in place. (This assumes we make a copy of the data
> at the top of the function: arr = arr.astype(float))
>
> Proposed function signature: mov_sum(arr, window, axis=-1),
> mov_nansum(arr, window, axis=-1)
>
> If you don't like mov, then: move? roll?
>
> I think requesting a minimum number of non-nan elements in a window or
> else returning NaN is clever. But I do like the simple signature
> above.
>
> Binary moving window functions: mov_nancorr(arr1, arr2, window, axis=-1), etc.
>
> Optional: moving window bootstrap estimate of error (std) of the
> moving statistic. So, what's the std of each erstimate in the
> mov_median output? Too specialized?
>
> dtype: float64
> ndim: 1, 2, 3, recursive for nd > 0
>
> NORMALIZATION
>
> I already have nd versions of ranking, zscore, quantile, demean,
> demedian, etc in larry. We should rename to nanzscore etc.
>
> ranking and quantile could use some cython love.
>
> I don't know, should we cut this category?
>
> GROUP FUNCTIONS
>
> Input: array, sequence of labels such as a list, axis.
>
> For an array of shape (n,m), axis=0, and a list of n labels with d
> distinct values, group_nanmean would return a (d,m) array. I'd also
> like a groupfilter_nanmean which would return a (n,m) array and would
> have an additional, optional input: exclude_self=False.
>
> NAME
>
> What should we call the package?
>
> Numa, numerical analysis with numpy arrays
> Dana, data analysis with numpy arrays
>
> import dana as da     (da=data analysis)
>
> ARE YOU CRAZY?
>
> If you read this far, you are crazy and would be a good fit for this project.
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>



More information about the SciPy-User mailing list