PEP 450 Adding a statistics module to Python

Fri Aug 16 14:51:49 EDT 2013

On Fri, 16 Aug 2013 09:31:34 -0700, chris.barker wrote:

>> > I am seeking comments on PEP 450, Adding a statistics module to
>> > Python's
> 
> The trick here is that numpy really is the "right" way to do this stuff.

Numpy does not have a monopoly on the correct algorithms for statistics 
functions, and a big, heavyweight library like numpy is overkill for many 
lightweight statistics tasks. One shouldn't need to turn on a nuclear 
reactor just to put the light on in your fridge.

> I like to say:
> "crunching numbers in python without numpy is like doing text processing
> without using the string object"

Your analogy is backwards. String objects actually aren't optimal for 
heavy duty text processing, because they're immutable. If you're serious 
about crunching vast amounts of numbers, you'll use numpy. If you're 
serious about crunch vast amounts of text, say for a text editor or word 
processor, you *won't* use strings, you'll use some sort of mutable 
buffer, or ropes, or some other data type. But very unlikely to use 
strings.

> What this is really an argument for is a numpy-lite in the standard
> library, which could be used to build these sorts of things on. But
> that's been rejected before...

"Numpy-lite". Which parts of numpy? Who maintains it? The numpy release 
schedule is nothing like the standard library's release schedule, so 
which one has to change? Or does somebody fork numpy, giving two 
independent code bases?

What about Jython, IronPython, and other Python implementations? Even 
PyPy doesn't support numpy yet, and Jython and IronPython probably never 
will, since they're not C-based.

> A few other comments:
> 
> 1) the numpy folks have been VERY good at providing binaries for Windows
> and OS-X -- easy point and click installing.
> 
> 2) I hope we're almost there with standardizing pip and binary wheels,
> at which point pip install will be painless.

Yeah, right, sure it will be. I've been waiting a decade for package 
management on Linux to become painless, and it still isn't. There's no 
reason to expect pip will be more painless than aptitude or yum.

But even if it is, installation of software is not just a software 
problem to be solved by better technology. There is also the social 
problem that not everyone is permitted to arbitrarily install software. 
I'm not just talking about security policies on the machine, but security 
policies in real life. People can be sacked for installing software they 
don't have permission to install.

Machines may be locked down, users may have to submit a request before 
software will be installed. That may involve a security audit, legal 
review of licencing, strategy for full roll-back, potentially even a 
complete code audit. (Imagine auditing all of numpy.) Or policy may 
simply say, *no software from unapproved vendors* full stop.

Not everyone is privileged to be permitted to install whatever software 
they like, when they like. Here are two organisations that make software 
installation requests *easy*:

http://www.uhd.edu/computing/acl/SoftwareInstallationRequest.html

http://www.calstatela.edu/its/services/software/
instructsoftwarerequest.php/form2.php

Pip install isn't going to fix that.

There are many, many people in a situation where the Python std lib is 
approved, usually because it comes from a vendor with a support contract 
(say, RedHat, Ubuntu, or Suse), but getting third-party packages like 
numpy approved is next to impossible. "Just install numpy" is a solution 
for a privileged few.

> even before (2) -- pip install works fine anywhere the system is set up
> to build python extensions (granted, not a given on Windows and Mac, but
> pretty likely on Linux)

Oh, well that's okay then -- that's three, maybe four percent of the 
computing world taken care of! Problem solved!

Not.

> -- the idea that running pip install wrote out a
> lot of text (but worked!) is somehow a barrier to entry is absurd --
> anyone building their own stuff on Linux is used to that.

Do you realise that not all Python programmers are used to, or able to, 
"build their own stuff on Linux"?

[...]
> All that being said -- if you do decide to do this, please use a PEP
> 3118 (enhanced buffer) supporting data type (probably array.array) --
> compatibility with numpy and other packages for crunching numbers is
> very nice.

py> import array
py> data = array.array('f', range(1000))
py> import statistics
py> statistics.mean(data)
499.5
py> statistics.stdev(data)
288.8194360957494

If the data type supports the sequence protocol, it should work with my 
module. If it fails to work, submit a bug report, and I will fix it.

> If someone decides to build a stand-alone stats package -- building it
> on a ndarray-lite (PEP 3118 compatible) object would be a nice way to
> go.
> 
> 
> One other point -- for performance reason, is would be nice to have some
> compiled code in there -- this adds incentive to put it in the stdlib --
> external packages that need compiling is what makes numpy unacceptable
> to some folks.

Like the decimal module, it will probably remain pure-Python for a few 
releases, but I hope that in the future the statistics module will gain a 
C-accelerated version. (Or Java-accelerated for Jython, etc.) I expect 
that PyPy won't need one. But because it's not really aimed at number-
crunching megabytes of data, speed is not the priority.

-- 
Steven