[Python-ideas] Proposal: add a calculator statistics module

Tal Einat taleinat at gmail.com
Sat Oct 1 21:32:43 CEST 2011


On Tue, Sep 13, 2011 at 12:23 PM, Paul Moore <p.f.moore at gmail.com> wrote:

> On 13 September 2011 05:06, Nick Coghlan <ncoghlan at gmail.com> wrote:
> > On Tue, Sep 13, 2011 at 11:00 AM, Steven D'Aprano <steve at pearwood.info>
> wrote:
> >> I propose adding a basic calculator statistics module to the standard
> >> library, similar to the sorts of functions you would get on a scientific
> >> calculator:
> >>
> >> mean (average)
> >> variance (population and sample)
> >> standard deviation (population and sample)
> >> correlation coefficient
> >>
> >> and similar. I am volunteering to provide, and support, this module,
> written
> >> in pure Python so other implementations will be able to use it.
> >>
> >> Simple calculator-style statistics seem to me to be a fairly obvious
> >> "battery" to be included, more useful in practice than some functions
> >> already available such as factorial and the hyperbolic functions.
> >
> > And since some folks may not have seen it, Steven's proposal here is
> > following up on a suggestion Raymond Hettinger posted to this last
> > year:
> >
> > http://mail.python.org/pipermail/python-ideas/2010-October/008267.html
> >
> > >From my point of view, I'd make the following suggestions:
> >
> > 1. We should start very small (similar to the way itertools grew over
> time)
> >
> > To me that means:
> >  mean, median, mode
> >  variance
> >  standard deviation
> >
> > Anything beyond that (including coroutine-style running calculations)
> > is probably better left until 3.4. In the specific case of running
> > calculations, this is to give us a chance to see how coroutine APIs
> > are best written in a world where generators can return values as well
> > as yielding them. Any APIs that would benefit from having access to
> > running variants (such as being able to collect multiple statistics in
> > a single pass) should also be postponed.
> >
> > Some more advanced algorithms could be included as recipes in the
> > initial docs. The docs should also include pointers to more
> > full-featured stats modules for reference when users needs outgrow the
> > included batteries.
> >
> > 2. The 'math' module is not the place for this, a new, dedicated
> > module is more appropriate. This is mainly due to the fact that the
> > math module is focused primarily on binary floating point, while these
> > algorithms should be neutral with regard to the specific numeric type
> > involved. However, the practical issues with math being a builtin
> > module are also a factor.
> >
> > There are many colours the naming bikeshed could be painted, but I'd
> > be inclined to just call it 'statistics' ('statstools' is unwieldy,
> > and other variants like 'stats', 'simplestats', 'statlib' and
> > 'stats-tools' all exist on PyPI). Since the opportunity to just use
> > the full word is there, we may as well take it.
>
> +1 (both on the Steven's original suggestion, and Nick's follow-up
> comment).
>
> I like the suggestion of having a running calculation version, but
> agree that it's probably a bit soon to decide on the best API for such
> things. Recipes in the documentation would be a good start, though.
>

In the past few months I've done some work on "running calculations" in
Python, and came up with a module I call RunningCalcs:
http://pypi.python.org/pypi/RunningCalcs/
http://bitbucket.org/taleinat/runningcalcs/
It includes comprehensive tests and some benchmarks (in the wiki at
BitBucket).

If "running calculations" are to be considered for inclusion in the stdlib,
I propose RunningCalcs as an example implementation. Note that implementing
calculations in this manner makes performing several calculations on a
single iterable very easy and potentially efficient.

RunningCalcs includes implementations of a few calculations, including mean,
variance and stdandard deviation, min & max, several summation algorithms
and n-largest & n-smallest. Implementing a RunningCalc is simple and
straight-forward. Usage is as follows:

# feeding inputs directly to the RunningCalc instances, one input at a time
mean_rc, stddev_rc = RunningMean(), RunningStdDev()
for x in inputs:
    mean_rc.feed(x)
    stddev_rc.feed(x)
mean, stddev = mean_rc.value, stddev_rc.value

# easy & fast calculation using apply_in_parallel()
a_i_p = apply_in_parallel
mean, stddev = a_i_p(inputs, [RunningMean(), RunningStdDev()])
small5, large5 = a_i_p(inputs, [RunningNSmallest(5), RunningNLargest(5)])

Regarding co-routines: During development I considered using
co-routine-generators; my implementation of Kahan summation still uses such
a generator. I've found this isn't a good generic method for implementing
"running calculations", mainly because such a generator must return the
current value at each iteration, even though this value is usually not
needed nearly so often. For example, implementing a running version of
n-largest using a co-routine/generator would introduce a large overhead,
whereas my version is as fast as _heapq.nlargest (which is implemented in C
-- see benchmarks for details).

- Tal Einat
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20111001/f659aedb/attachment.html>


More information about the Python-ideas mailing list