[Python-ideas] Proposal: add a calculator statistics module

Tue Sep 13 11:23:35 CEST 2011

On 13 September 2011 05:06, Nick Coghlan <ncoghlan at gmail.com> wrote:
> On Tue, Sep 13, 2011 at 11:00 AM, Steven D'Aprano <steve at pearwood.info> wrote:
>> I propose adding a basic calculator statistics module to the standard
>> library, similar to the sorts of functions you would get on a scientific
>> calculator:
>>
>> mean (average)
>> variance (population and sample)
>> standard deviation (population and sample)
>> correlation coefficient
>>
>> and similar. I am volunteering to provide, and support, this module, written
>> in pure Python so other implementations will be able to use it.
>>
>> Simple calculator-style statistics seem to me to be a fairly obvious
>> "battery" to be included, more useful in practice than some functions
>> already available such as factorial and the hyperbolic functions.
>
> And since some folks may not have seen it, Steven's proposal here is
> following up on a suggestion Raymond Hettinger posted to this last
> year:
>
> http://mail.python.org/pipermail/python-ideas/2010-October/008267.html
>
> >From my point of view, I'd make the following suggestions:
>
> 1. We should start very small (similar to the way itertools grew over time)
>
> To me that means:
>  mean, median, mode
>  variance
>  standard deviation
>
> Anything beyond that (including coroutine-style running calculations)
> is probably better left until 3.4. In the specific case of running
> calculations, this is to give us a chance to see how coroutine APIs
> are best written in a world where generators can return values as well
> as yielding them. Any APIs that would benefit from having access to
> running variants (such as being able to collect multiple statistics in
> a single pass) should also be postponed.
>
> Some more advanced algorithms could be included as recipes in the
> initial docs. The docs should also include pointers to more
> full-featured stats modules for reference when users needs outgrow the
> included batteries.
>
> 2. The 'math' module is not the place for this, a new, dedicated
> module is more appropriate. This is mainly due to the fact that the
> math module is focused primarily on binary floating point, while these
> algorithms should be neutral with regard to the specific numeric type
> involved. However, the practical issues with math being a builtin
> module are also a factor.
>
> There are many colours the naming bikeshed could be painted, but I'd
> be inclined to just call it 'statistics' ('statstools' is unwieldy,
> and other variants like 'stats', 'simplestats', 'statlib' and
> 'stats-tools' all exist on PyPI). Since the opportunity to just use
> the full word is there, we may as well take it.

+1 (both on the Steven's original suggestion, and Nick's follow-up comment).

I like the suggestion of having a running calculation version, but
agree that it's probably a bit soon to decide on the best API for such
things. Recipes in the documentation would be a good start, though.

One place I'd disagree with Nick, though, I'd like to see correlation
coefficient and linear regression in there. They are common on
calculators, and I do tend to use them reasonably often. Please save
me from starting Excel to calculate them! :-)

Paul.