[Numpy-discussion] new incremental statistics project

Bradford Cross bradford.n.cross at gmail.com
Thu Dec 25 06:51:57 EST 2008


I did not know about this - very cool!  I think I was asking around the
numpy/scipy lists a while back but nobody mentioned this; is it new?

A couple of questions inline below.

On Fri, Dec 19, 2008 at 2:53 PM, John Hunter <jdh2358 at gmail.com> wrote:

> On Thu, Dec 18, 2008 at 8:27 PM, Bradford Cross
> <bradford.n.cross at gmail.com> wrote:
> > This is a new project I just released.
> >
> > I know it is C#, but some of the design and idioms would be nice in
> > numpy/scipy for working with discrete event simulators, time series, and
> > event stream processing.
> >
> > http://code.google.com/p/incremental-statistics/
>
> I think an incremental stats module would be a boon to numpy or scipy.
>  Eric Firing has a nice module wrtten in C with a pyrex wrapper
> (ringbuf)


Please excuse my ignorance - what is the performance overhead of calling C
via the pyrex wrapper?  A lot of use cases for incremental statistics are
discrete event systems where the calculations will be updated millions or
billions of times; this was a concern I had about doing the project in C and
calling across a wrapper.  Maybe it was one of those entirely speculative
and unfounded concerns. :-)



> that does trailing incremental mean, median, std, min, max,
> and percentile.  It maintains a sorted queue to do the last three
> efficiently, and handles NaN inputs.


Not sure if our results hold universally or even asymptoticly, but we found
that our implimention of order/rank statistics was faster when we backed it
with partition selection algorithms operating on an array-based queue as
opposed to our implimentaion of a sorted dequeue backed by a circular
buffer.

How does it handle NaN inputs exactly - does it just guard against them?
That is the approach we took as well.  We have a calculation guard that
filters for both NaN and infinite values.



> I would like to see this
> extended to include exponential or other weightings to do things like
> incremental trailing exponential moving averages and variances.


This is a cool idea that I hadn't thought of.  We do have exponentially
weighted mean, but ideally one could supply a weighting function to any
statistic.  We've been moving toward a more functional combinator style
library design lately and this is anothr step in that direction.


> I
> don't know what the licensing terms are of this module, but it might
> be a good starting point for an incremental numpy stats module, at
> least if you were thinking about supporting a finite lookback window.


Yes, it sound great!  If you read the docs here:
http://code.google.com/p/incremental-statistics/  you can see that are have
taken care to build the library from the beginning for static, accumulating,
and rolling cases.  The rolling case is what you are refering to as a finite
lookback window, whereas accumualting as an accumulating lookback window,
and the static case is the typical "compute hte mean of the entire sieries
of observations at once" case.  IMO, it turns out really nice when you think
this way from the begnning becasue you get a lot of code reuse and nice
oppertunities for composition.


>
> We have a copy of this in the py4science examples dir if you want to
> take a look:
>
>    svn co
> https://matplotlib.svn.sourceforge.net/svnroot/matplotlib/trunk/py4science/examples/pyrex/trailstats
>    cd trailstats/
>   make
>   python movavg_ringbuf.py
>
> Other things that would be very useful are incremental covariance and
> regression.


Indeed.  We have a bit on the dependence statistics side, but not much.
Incremental dependence and regression are the two hot items on the backlog.
:-)


>
>
> JDH
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20081225/051ae9bb/attachment.html>


More information about the NumPy-Discussion mailing list