PEP 450 Adding a statistics module to Python

Josef Pktd josef.pktd at gmail.com
Sat Aug 17 08:13:32 EDT 2013


I think the install issues in the pep are exaggerated, and are in my opinion not a sufficient reason to get something into the standard lib.

google appengine includes numpy
https://developers.google.com/appengine/docs/python/tools/libraries27

I'm on Windows, and installing numpy and scipy are just binary installers that install without problems.
There are free binary distributions (for Windows and Ubuntu) that include all the main scientific applications. One-click installer on Windows
http://code.google.com/p/pythonxy/wiki/Welcome
http://code.google.com/p/winpython/

How many Linux distributions don't include numpy? (I have no idea.)

For commercial support Enthought's and Continuum's distributions include all the main packages.

I think having basic descriptive statistics is still useful in a basic python installation. Similarly, almost all the descriptive statistics moved from scipy.stats to numpy.

However, what is the longterm scope of this supposed to be?

I think working with pure python is interesting for educational purposes
http://www.greenteapress.com/thinkstats/
but I don't think it will get very far for more extensive uses. Soon you will need some linear algebra (numpy.linalg and scipy.linalg) and special functions (scipy.special).

You can reimplement them, but what's the point to duplicate them in the standard lib?

For example:

t test: which versions? one-sample, two-sample, paired and unpaired, with and without homogeneous variances, with 3 alternative hypothesis.

If we have t test, shouldn't we also have ANOVA when we want to compare more than two samples?

...

If the Python versions that are not using a C backend need a statistics package and partial numpy replacement, then I don't think it needs to be in the CPython lib.


If think the "nuclear reactor" analogy is in my opinion misplaced.

A python implementation of statistics is a bycycle, numpy is a car, and if you need some heavier lifting in statistics or machine learning, then the trucks are scipy, scikit-learn and statsmodels (and pandas for the data handling).
And rpy for things that are not directly available in python.


I'm one of the maintainers for scipy.stats and for statsmodels.

We have a similar problem of deciding on the boundaries and scope of numpy, scipy.stats, pandas, patsy, statsmodels and scikit-learn. There is some overlap of functionality where the purpose or use cases are different, but in general we try to avoid too much duplication.


https://pypi.python.org/pypi/statsmodels
https://pypi.python.org/pypi/pandas
https://pypi.python.org/pypi/patsy  (R like formulas)
https://pypi.python.org/pypi/scikit-learn


Josef



More information about the Python-list mailing list