[Numpy-discussion] Tools / data structures for statistical analysis and related applications

Wed Jun 9 16:40:46 EDT 2010

Dear all,

We've been having discussions on the pystatsmodels mailing list
recently regarding data structures and other tools for statistics /
other related data analysis applications.  I believe we're trying to
answer a number of different, but related questions:

1. What are the sets of functionality (and use cases) which would be
desirable for the scientific (or statistical) Python programmer?
Things like groupby
(http://projects.scipy.org/numpy/browser/trunk/doc/neps/groupby_additions.rst)
fall into this category.

2. Do we really need to build custom data structures (larry, pandas,
tabular, etc.) or are structured ndarrays enough? (My conclusion is
that we do need to, but others might disagree). If so, how much
performance are we willing to trade for functionality?

3. What needs to happen for Python / NumPy / SciPy to really "break
in" to the statistical computing field? In other words, could a
Python-based stack one day be a competitive alternative to R?

These are just some ideas for collecting community input. Of course as
we're all working in different problem domains, the needs of users
will vary quite a bit across the board. We've started to collect some
thoughts, links, etc. on the scipy.org wiki:

http://scipy.org/StatisticalDataStructures

A lot of what's there already is commentary and comparison on the
functionality provided by pandas and la / larry (since Keith and I
wrote most of the stuff there). But I think we're trying to identify
more generally the things that are lacking in NumPy/SciPy and related
libraries for particular applications. At minimum it should be good
fodder for the SciPy conferences this year and afterward (I am
submitting a paper on this subject based on my experiences).

- Wes