PEP 450 Adding a statistics module to Python

Oscar Benjamin oscar.j.benjamin at gmail.com
Fri Aug 16 13:15:52 EDT 2013


On 16 August 2013 17:31,  <chris.barker at noaa.gov> wrote:
>> > I am seeking comments on PEP 450, Adding a statistics module to Python's
>
> The trick here is that numpy really is the "right" way to do this stuff.

Although it doesn't mention this in the PEP, a significant point that
is worth bearing in mind is that numpy is only for CPython, not PyPy,
IronPython, Jython etc. See here for a recent update  on the status of
NumPyPy:
http://morepypy.blogspot.co.uk/2013_08_01_archive.html

> I like to say:
> "crunching numbers in python without numpy is like doing text processing without using the string object"

It depends what kind of number crunching you're doing. Numpy gives
efficient C-style number crunching but it doesn't really give
efficient ways to take advantage of the areas where Python is better
than C such as having efficient infinite range integers, and decimal
and rational arithmetic in the standard library. You can use
dtype=object to use all these things with numpy arrays but in my
experience this is typically not faster than working with Python lists
and is only really useful when you want numpy's multi-dimensional,
view-type slicing.

Here's an example where Steven's statistics module is more accurate:

    >>> numpy.mean([-1e60, 100, 100, 1e60])
    0.0
    >>> statistics.mean([-1e60, 100, 100, 1e60])
    50.0

Okay so that's a toy example but it illustrates that Steven is aiming
for ultra-high accuracy where numpy is primarily aimed at speed. He's
also tried to ensure that it works properly with e.g. fractions:

    >>> from fractions import Fraction as F
    >>> data = [F('1/7'), F('3/7')]
    >>> numpy.mean(data)
    0.2857142857142857
    >>> statistics.mean(data)
    Fraction(2, 7)

and decimals:

    >>> data = [D('0.1'), D('0.01'), D('0.001')]
    >>> numpy.mean(data)
        ....
    TypeError: unsupported operand type(s) for /: 'decimal.Decimal' and 'float'
    >>> statistics.mean(data)
    Decimal('0.037')

> What this is really an argument for is a numpy-lite in the standard library, which could be used to build these sorts of things on. But that's been rejected before...

If it's a numpy-lite then it's a numpy-ultra-lite. It really doesn't
provide much of what numpy provides. I would describe it as a Pythonic
implementation of elementary statistical computation rather than a
numpy-lite.

[snip]
>
> All that being said -- if you do decide to do this, please use a PEP 3118 (enhanced buffer) supporting data type (probably array.array) -- compatibility with numpy and other packages for crunching numbers is very nice.
>
> If someone decides to build a stand-alone stats package -- building it on a ndarray-lite (PEP 3118 compatible) object would be a nice way to go.

Why? Yes I'd also like an ndarray-lite or rather an ultra-lite
1-dimensional version but why would it be useful for the statistics
module over using standard Python containers? Note that numpy arrays
do work with the reference implementation of the statistics module
(they're just treated as iterables):

    >>> import numpy
    >>> import statistics
    >>> statistics.mean(numpy.array([1, 2, 3]))
    2.0
    >>> statistics.mean(numpy.array([[1, 2, 3], [4, 5, 6]]))
    array([ 2.5,  3.5,  4.5])

> One other point -- for performance reason, is would be nice to have some compiled code in there -- this adds incentive to put it in the stdlib -- external packages that need compiling is what makes numpy unacceptable to some folks.

It might be good to have a C accelerator one day but actually I think
the pure-Python-ness of it is a strong reason to have it since it
provides accurate statistics functions to all Python implementations
(unlike numpy) at no additional cost.


Oscar



More information about the Python-list mailing list