PEP 450 Adding a statistics module to Python

Fri Aug 16 15:48:16 EDT 2013

On Friday, August 16, 2013 11:51:49 AM UTC-7, Steven D'Aprano wrote:
> > The trick here is that numpy really is the "right" way to do this stuff.

> Numpy does not have a monopoly on the correct algorithms for statistics  
> functions,

indeed not -- in fact, a number of them are quite lame, either because of chosen speed vs. accuracy trade offs, or just plain no-one-got-around-to-writing-the-code.

I kind of mis-spoke: what I meant was: "a numpy ndarray-similar object is the "right"way to do this", not numpy itself.

> and a big, heavyweight library like numpy is overkill for many  
> lightweight statistics tasks. One shouldn't need to turn on a nuclear  
> reactor just to put the light on in your fridge.

sure -- but you are talking stdlib here -- where do we draw the line? a hard choice every time.

> > "crunching numbers in python without numpy is like doing text processing 
> > without using the string object"
> 
> Your analogy is backwards. String objects actually aren't optimal for  
> heavy duty text processing, because they're immutable. If you're serious 
> about crunching vast amounts of numbers, you'll use numpy. If you're 
> serious about crunch vast amounts of text, say for a text editor or word  
> processor, you *won't* use strings, you'll use some sort of mutable  
> buffer, or ropes, or some other data type. But very unlikely to use  
> strings.

but you sure as heck won't use arbitrary pyton sequences of characters. which is what you are doing with this module.

> > What this is really an argument for is a numpy-lite in the standard 
> > library, which could be used to build these sorts of things on. But 
> > that's been rejected before...

> "Numpy-lite". Which parts of numpy? Who maintains it? The numpy release  
> schedule is nothing like the standard library's release schedule, so  
> which one has to change? Or does somebody fork numpy, giving two  
> independent code bases?

yup -- that's why it's been rejected before -- but we did get PEP 3118 as a compromise, so one could build an nd-array-lite that was PEP 3118 compatible, and avoid many of the problems above.

However, as much a problem is is to install a third-party compiled package, it's a hell of a lot less work than writing a bunch of new code, so it'll probably never get done.

I myself am trying to write my new stuff to take PEP 3118 buffers, so I can get full high-performing numpy support, but not require users to have numpy -- it is a bit tricky, but can be done. If/when you get to the C-accelerated version, I suggest you consider it.

> What about Jython, IronPython, and other Python implementations? Even  
> PyPy doesn't support numpy yet, and Jython and IronPython probably never 
> will, since they're not C-based.

There is a numpy for IronPython, though I don't hink it got beyond the alpha stage. But your point is well taken -- but also a reason for an ndarray in the stdlib, then maybe other implementations would support it.

> Yeah, right, sure it will be. I've been waiting a decade for package  
> management on Linux to become painless, and it still isn't. There's no  
> reason to expect pip will be more painless than aptitude or yum.

Probably not, true -- but you needed to get Python from somewhere didn't you? You can't see it's easy to compile that on Windows!

> There is also the social  problem that not everyone is permitted to arbitrarily install software. 

I work for the Federal Government -- believe me, I know.

There's Google App Engine, and things like that too, to support your point....

> complete code audit. (Imagine auditing all of numpy.)

well, the more we add to Pyton's stdlib, the bigger an issue that will be for all Ptyon users -- antoher reason to be cautios.

But at the end, I don't think there is a lot you can do with pyton without installing some third-party package? How many people do all their code development in IDLE? al their GUI's with tk? no image processing , writing their own web framework from scratch? The list goes on and on. I may have a few simple text processing scripts that don't use any third party packages, but nothing major.

I teach Intro to Python, and while I could probably get away with only the stdlib for the intro class (but sure as heck not the web development class), I don't -- because there is a lot folks should know about do anything real in Python.

So as much of a pain as it can be to use third-party packages, we can't put everything in the stdlib for that reason.

> There are many, many people in a situation where the Python std lib is  
> approved, usually because it comes from a vendor with a support contract 
> (say, RedHat, Ubuntu, or Suse), but getting third-party packages like  
> numpy approved is next to impossible. 

don't all three of those ship numpy? I haven't used them in ages.

> > to build python extensions (granted, not a given on Windows and Mac, but
> > pretty likely on Linux)
> 
> Oh, well that's okay then -- that's three, maybe four percent of the  
> computing world taken care of! Problem solved!

hence the binaries....

really -- the "I can't install an unapproved package" is a show-stopper. "I can't built it" isn't.

> > anyone building their own stuff on Linux is used to that.
> 
> Do you realise that not all Python programmers are used to, or able to, 
> 
> "build their own stuff on Linux"?

then why not "yum install numpy"? or whatever?

> > All that being said -- if you do decide to do this, please use a PEP
> > 3118 (enhanced buffer) supporting data type (probably array.array) --
> > compatibility with numpy and other packages for crunching numbers is
> > very nice.
> 
> py> import array
> py> data = array.array('f', range(1000))
> py> import statistics
> py> statistics.mean(data)
> 499.5

I realized this after posting -- that is a nice feature, and could help a lot -- hurray for the buffer protocol!  This makes room for compiled optimization down the road, and then you might be able to use your code with numpy arrays efficiently.

> If the data type supports the sequence protocol, it should work with my  
> module. If it fails to work, submit a bug report, and I will fix it.

fair enough.

> Like the decimal module, it will probably remain pure-Python for a few  
> releases, but I hope that in the future the statistics module will gain a  
> C-accelerated version. (Or Java-accelerated for Jython, etc.)

a perfectly reasonable development path.

 I expect 

> that PyPy won't need one. But because it's not really aimed at number- 
> crunching megabytes of data, speed is not the priority.

I thought one of the key points of PyPy was performance? But anyway, maybe RPython and the JIT will take care of that.

Anyway, this looks like a great project -- not so sure about putting it in the stdlib, and do hope you'll keep the number crunchers in mind, but great stuff none the less.

-Chris