[Numpy-discussion] new incremental statistics project

David Cournapeau cournape at gmail.com
Thu Jan 22 05:04:34 EST 2009


On Mon, Jan 19, 2009 at 7:34 PM, Hans Meine
<meine at informatik.uni-hamburg.de> wrote:
> On Friday 19 December 2008 03:27:12 Bradford Cross wrote:
>> This is a new project I just released.
>>
>> I know it is C#, but some of the design and idioms would be nice in
>> numpy/scipy for working with discrete event simulators, time series, and
>> event stream processing.
>>
>> http://code.google.com/p/incremental-statistics/
>
> Hi, do you know about the boost accumulators project?
>
> It's still in boost's sandbox, but I love its design, and it provides a large
> number of well-documented, mathematically sound estimators for variance, mean,
> etc.:
> http://boost-sandbox.sourceforge.net/libs/accumulators/doc/html/index.html
>
> Just a heads-up, in case someone finds this useful here.
> (Don't know about people's fondness of boost and/or C++ here.)

Not a boost/C++ fan, but I like those projects. Incremental statistics
have several advantages (outside the obvious one to get an online
estimate when the data arrive sequentially): they can be much more
memory friendly in a python context (for example, if you want to
compute statistics for billion of samples, you could do in mini
batches, and an incremental framework can help here), and they can
often converge faster than an offline version if you have all the
data.

I am not yet clear how pervasive those techniques are - I have looked
at several papers which prove the convergence of several well known
algorithms, and implemented some of them (in particular online EM
algorithm for online estimation of mixtures of Gaussian, with Bayesian
variations for sequential model comparison), and I would have expected
them to be more well known. I may just not be that familiar with the
concerned fields, though.

cheers,

David



More information about the NumPy-Discussion mailing list