[SciPy-User] Proposal for a new data analysis toolbox

Mon Nov 22 10:35:21 EST 2010

This thread started on the numpy list:
http://mail.scipy.org/pipermail/numpy-discussion/2010-November/053958.html

I think we should narrow the focus of the package by only including
functions that operate on numpy arrays. That would cut out date
utilities, label indexing utilities, and binary operations with
various join methods on the labels. It would leave us with three
categories: faster versions of numpy/scipy nan functions, moving
window statistics, and group functions.

I suggest we add a fourth category: normalization.

FASTER NUMPY/SCIPY NAN FUNCTIONS

This work is already underway: http://github.com/kwgoodman/nanny

The function signatures for these are easy: we copy numpy, scipy. (I
am tempted to change nanstd from scipy's bias=False to ddof=0.)

I'd like to use a partial sort for nanmedian. Anyone interested in coding that?

dtype: int32, int64, float 64 for now
ndim: 1, 2, 3 (need some recursive magic for nd > 3; that's an open
project for anyone)

MOVING WINDOW STATISTICS

I already have doc strings and unit tests
(https://github.com/kwgoodman/la/blob/master/la/farray/mov.py). And I
have a cython prototype that moves the window backwards so that the
stats can be filled in place. (This assumes we make a copy of the data
at the top of the function: arr = arr.astype(float))

Proposed function signature: mov_sum(arr, window, axis=-1),
mov_nansum(arr, window, axis=-1)

If you don't like mov, then: move? roll?

I think requesting a minimum number of non-nan elements in a window or
else returning NaN is clever. But I do like the simple signature
above.

Binary moving window functions: mov_nancorr(arr1, arr2, window, axis=-1), etc.

Optional: moving window bootstrap estimate of error (std) of the
moving statistic. So, what's the std of each erstimate in the
mov_median output? Too specialized?

dtype: float64
ndim: 1, 2, 3, recursive for nd > 0

NORMALIZATION

I already have nd versions of ranking, zscore, quantile, demean,
demedian, etc in larry. We should rename to nanzscore etc.

ranking and quantile could use some cython love.

I don't know, should we cut this category?

GROUP FUNCTIONS

Input: array, sequence of labels such as a list, axis.

For an array of shape (n,m), axis=0, and a list of n labels with d
distinct values, group_nanmean would return a (d,m) array. I'd also
like a groupfilter_nanmean which would return a (n,m) array and would
have an additional, optional input: exclude_self=False.

NAME

What should we call the package?

Numa, numerical analysis with numpy arrays
Dana, data analysis with numpy arrays

import dana as da     (da=data analysis)

ARE YOU CRAZY?

If you read this far, you are crazy and would be a good fit for this project.