[Python-ideas] Pre-PEP: adding a statistics module to Python

Mon Aug 5 03:40:27 CEST 2013

On 04/08/13 22:51, Eli Bendersky wrote:
> On Sun, Aug 4, 2013 at 12:07 AM, Ethan Furman <ethan at stoneleaf.us> wrote:
>> On 08/03/2013 07:00 PM, Eli Bendersky wrote:
>>>
>>>
>>> While I'm somewhat -0.5 on the general idea of the statistics module
>>> (competing with well-established, super-optimized and
>>> by-themselves-famous numeric libraries Python has does not sound like
>>> a worthy goal),
>>
>>
>> Sure, competing with already established libraries is silly.  Fortunately,
>> that's not what is happening here.  This PEP is about providing a minimal,
>> common set of statistics functions for the average person.
>
> I'm really not sure who this average person is, but everyone keeps
> talking about him. Is it the same person for whom Dummies books are
> written?
>
> Anyhow, "minimal" is a dangerous slope. With such a module in the
> stdlib, I'm 100% sure we'll get a constant stream of - please add just
> this function (from SciPy) - it's so useful to the "average person" -
> requests. This is unavoidable. And it will be difficult to judge at
> that point why certain funcitonality belongs or does not belong here.
> So over time we'll end up with a partial Greenspun, by containing an
> ad hoc, slow implementation of half of Numpy/SciPy.

[only half serious]
Perhaps we should have a pure-Python implementation of numpy/scipy, for non-C based Pythons. If I recall correctly, PyPy had to engage in a massive effort to get numpy even partially working. The pure-Python part of the stdlib is not just the stdlib for CPython, but potentially for the entire Python universe.

> Efforts are better spent in writing a new tutorial on Numpy that shows
> how to do the stuff statistics.py does. Call it "Numpy statistics for
> the average person".

That does not help those who are unable to install numpy due to restrictive policies about what software can be installed.

The choice is not either statistics or better tutorials. We can have both, if somebody volunteers to write those tutorials, or neither. I am not volunteering to write numpy tutorials.

>>>   I have to agree with Alexander w.r.t. "sum". Strongly
>>> -1 from me on having functions with the same name as existing stdlib
>>> functions but different functionality. This is very much unpythonic.
>>
>>
>> I thought the whole point of name spaces was to be able to have the same
>> name mean different things in different contexts.  Surely no one expects to
>> be able to use `webbrowser.open` or `gzip.open` anywhere `open` can be used.
>
> This is not a fair comparison. As a pop quiz, try to imagine the
> difference between 'open' and 'gzip.open' - do you immediately come up
> with the differences in their functionalities? Now, how about 'sum'
> and 'statistics.sum'?

As far as gzip.open goes, I have no idea. Like most people, I expect that there is some difference -- perhaps it only works on gzip files? is the API different in some way? -- but beyond that vague idea that "it is in a different module, therefore it must be different *somehow*" I have have no idea how it actually differs from the built-in, or codecs.open. I would have to look them up to find out what the differences actually are.

I expect that any even moderately competent user will think the same way: "statistics.sum is in a different module, presumably it is different somehow, I should look it up to find out how".

If you're going to assume a user who is familiar enough with the gzip module to immediately know the differences between gzip.open and builtins.open, then I should also be permitted to assume a user who is familiar with the statistics module and can likewise immediately come up with the differences:

- built-in sum supports any object which supports + except str and bytes, even if what + does is nothing like addition in the usual sense;
- built-in sum may be fast (written in C) or horribly slow (algorithmically O(n**2) for some types);
- built-in sum may be inaccurate for floats;

while

- statistics.sum supports numbers only;
- statistics.sum should always[1] be O(n) but may be slow (currently no C-accelerator version);
- statistics.sum may be more accurate for floats.

I don't expect people to know this without being told. Frankly, I don't even expect the typical numerically naive user to use statistics.sum when it is so much shorter to type "sum". I can provide a better numeric sum, but I can't force people to use it. But the statistics module uses it extensively, neither the built-in sum nor math.fsum are suitable for my purposes, and I wish to expose that functionality to users who are willing to use it.

And finally, I categorically refuse to call it any variation of statistics.ssum or statistics.statistics_sum. We have namespaces for a reason. If anyone wishes to start bikeshedding names, I will consider reasonable alternatives that don't repeat the name of the namespace.

[1] Excluding weird numeric types that do bizarre things.

-- 
Steven