[Python-ideas] Proposal: add a calculator statistics module

Tue Sep 13 06:06:45 CEST 2011

On Tue, Sep 13, 2011 at 11:00 AM, Steven D'Aprano <steve at pearwood.info> wrote:
> I propose adding a basic calculator statistics module to the standard
> library, similar to the sorts of functions you would get on a scientific
> calculator:
>
> mean (average)
> variance (population and sample)
> standard deviation (population and sample)
> correlation coefficient
>
> and similar. I am volunteering to provide, and support, this module, written
> in pure Python so other implementations will be able to use it.
>
> Simple calculator-style statistics seem to me to be a fairly obvious
> "battery" to be included, more useful in practice than some functions
> already available such as factorial and the hyperbolic functions.

And since some folks may not have seen it, Steven's proposal here is
following up on a suggestion Raymond Hettinger posted to this last
year:

http://mail.python.org/pipermail/python-ideas/2010-October/008267.html

>From my point of view, I'd make the following suggestions:

1. We should start very small (similar to the way itertools grew over time)

To me that means:
  mean, median, mode
  variance
  standard deviation

Anything beyond that (including coroutine-style running calculations)
is probably better left until 3.4. In the specific case of running
calculations, this is to give us a chance to see how coroutine APIs
are best written in a world where generators can return values as well
as yielding them. Any APIs that would benefit from having access to
running variants (such as being able to collect multiple statistics in
a single pass) should also be postponed.

Some more advanced algorithms could be included as recipes in the
initial docs. The docs should also include pointers to more
full-featured stats modules for reference when users needs outgrow the
included batteries.

2. The 'math' module is not the place for this, a new, dedicated
module is more appropriate. This is mainly due to the fact that the
math module is focused primarily on binary floating point, while these
algorithms should be neutral with regard to the specific numeric type
involved. However, the practical issues with math being a builtin
module are also a factor.

There are many colours the naming bikeshed could be painted, but I'd
be inclined to just call it 'statistics' ('statstools' is unwieldy,
and other variants like 'stats', 'simplestats', 'statlib' and
'stats-tools' all exist on PyPI). Since the opportunity to just use
the full word is there, we may as well take it.

Regards,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia