PEP 450 Adding a statistics module to Python

Fri Aug 16 15:00:18 EDT 2013

On Friday, August 16, 2013 10:15:52 AM UTC-7, Oscar Benjamin wrote:
> On 16 August 2013 17:31,  <chris.barker at noaa.gov> wrote:
> Although it doesn't mention this in the PEP, a significant point that
> 
> is worth bearing in mind is that numpy is only for CPython, not PyPy,
> 
> IronPython, Jython etc. See here for a recent update  on the status of

It does mention it, though I think not the additional implementations by name. And yes, the lack of numpy on the other implementation is a major limitation.

> > "crunching numbers in python without numpy is like doing text processing without using the string object"
> 
> It depends what kind of number crunching you're doing.

As it depends on what kind of text processing your doing.....you could go a long way with a pure-python sequence of abstract characters library,  but it would be painfully slow -- no one would even try.

I guess there are more people working with, say, hundreds of numbers, than people trying to process an equally tiny amount of text...but this is a digression.

My point about that is that you can only reasonably do string processing with python because python has the concept of a string, not just an arbitrary sequence of characters, and not just for speed's sake, but for the nice semantics.

Anyone that has used an array-oriented language or library is likely to get addicted to the idea that an array of numbers as a first class concept is really, really helpful, for both performance and semantics.

> Numpy gives efficient C-style number crunching

which is the vastly most common case. Also, a properly designed algorithm may well need to know something about the internal storage/processing of the data type -- i.e. the best way to compute a given statistic for floating point may not be the same as for integers (or decimal, or...). Maybe you can get a good one that works for most, but....

> You can use  dtype=object to use all these things with numpy arrays but in my 
> experience this is typically not faster than working with Python lists

That's quite true. In fact, often slower.

> and is only really useful when you want numpy's multi-dimensional, 
> view-type slicing.

which is very useful indeed!

> Here's an example where Steven's statistics module is more accurate:

>     >>> numpy.mean([-1e60, 100, 100, 1e60])
> 
>     0.0
> 
>     >>> statistics.mean([-1e60, 100, 100, 1e60])
> 
>     50.0

the wonders of floating point arithmetic! -- but this looks like more of an argument for a better algorithm in numpy, than a reason to have something in the stdlib -- in fact, that's been discussed lately, there is talk of using compensated summation in the numpy sum() method -- not sure of the status.

> Okay so that's a toy example but it illustrates that Steven is aiming 
> for ultra-high accuracy where numpy is primarily aimed at speed. 

well, yes, for the most part, numpy does trade speed for accuracy when it has too -- but that's not the case here, I think this is ta case of "no one took the time to write a better algorithm"

He's also tried to ensure that it works properly with e.g. fractions:

That is pretty cool, yes.

> > What this is really an argument for is a numpy-lite in the standard library, which could be used to build these sorts of things on. But that's been rejected before...
> 
> If it's a numpy-lite then it's a numpy-ultra-lite. It really doesn't
> provide much of what numpy provides.

I wasn't clear -- my point was that things like this should be build on a numpy-like array object (numpy-lite) -- so first adding such an object to the stdlib, then building this off it would be nice. But a key problem with that is where do you draw the line that defines numpy-lite? I"d say jsut the core storage object, but then someone wants to add statistics, and someone else wants to add polynomial, and then random numbers, then ... and pretty sure you've got numpy again!

> > All that being said -- if you do decide to do this, please use a PEP 3118 (enhanced buffer) supporting data type (probably array.array) -- compatibility with numpy and other packages for crunching numbers is very nice.
> 
> > If someone decides to build a stand-alone stats package -- building it on a ndarray-lite (PEP 3118 compatible) object would be a nice way to go.
> 
> Why? Yes I'd also like an ndarray-lite or rather an ultra-lite 
> 1-dimensional version but why would it be useful for the statistics 
> module over using standard Python containers? Note that numpy arrays 
> do work with the reference implementation of the statistics module 
> (they're just treated as iterables):

One of the really great things about numpy is that when you work with a LOT of numbers (which is not rare in this era of Big Data) it stores them efficiently, and you can push them around between different arrays, and other libraries without unpacking and copying data. That's what PEP 3118 is all about.

It looks like there is some real care being put into these algorithms, so it would be nice if they could be efficiently used for large data sets and with numpy.

>     >>> import numpy
>     >>> import statistics 
>     >>> statistics.mean(numpy.array([1, 2, 3]))

you'll probably find that this is slower than a python list -- numpy has some overhead when used as a generic sequence.

 > > One other point -- for performance reason, is would be nice to have some compiled code in there -- this adds incentive to put it in the stdlib -- external packages that need compiling is what makes numpy unacceptable to some folks.
>  
> It might be good to have a C accelerator one day but actually I think 
> the pure-Python-ness of it is a strong reason to have it since it 
> provides accurate statistics functions to all Python implementations 
> (unlike numpy) at no additional cost.

Well, I'd rather not have a package that is great for education and  toy problems, but not-so-good for the real ones...

I guess my point is this:

This is a way to make the standard python distribution better for some common computational tasks. But rather than think of it as "we need some stats functions in the python stdlib", perhaps we should be thinking: "out of the box python should be better for computation" -- in which case, I'd start with a decent array object.

-Chris