[Python-checkins] peps: Add PEP 450: Adding A Statistics Module To The Standard Library, by Steven
brett.cannon
python-checkins at python.org
Fri Aug 9 16:46:59 CEST 2013
http://hg.python.org/peps/rev/d8e0108ba02c
changeset: 5041:d8e0108ba02c
user: Brett Cannon <brett at python.org>
date: Fri Aug 09 10:46:53 2013 -0400
summary:
Add PEP 450: Adding A Statistics Module To The Standard Library, by Steven D'Aprano
files:
pep-0450.txt | 420 +++++++++++++++++++++++++++++++++++++++
1 files changed, 420 insertions(+), 0 deletions(-)
diff --git a/pep-0450.txt b/pep-0450.txt
new file mode 100644
--- /dev/null
+++ b/pep-0450.txt
@@ -0,0 +1,420 @@
+PEP: 450
+Title: Adding A Statistics Module To The Standard Library
+Version: $Revision$
+Last-Modified: $Date$
+Author: Steven D'Aprano <steve at pearwood.info>
+Status: Draft
+Type: Standards Track
+Content-Type: text/plain
+Created: 01-Aug-2013
+Python-Version: 3.4
+Post-History:
+
+
+Abstract
+
+ This PEP proposes the addition of a module for common statistics functions
+ such as mean, median, variance and standard deviation to the Python
+ standard library.
+
+
+Rationale
+
+ The proposed statistics module is motivated by the "batteries included"
+ philosophy towards the Python standard library. Raymond Hettinger and
+ other senior developers have requested a quality statistics library that
+ falls somewhere in between high-end statistics libraries and ad hoc
+ code.[1] Statistical functions such as mean, standard deviation and others
+ are obvious and useful batteries, familiar to any Secondary School student.
+ Even cheap scientific calculators typically include multiple statistical
+ functions such as:
+
+ - mean
+ - population and sample variance
+ - population and sample standard deviation
+ - linear regression
+ - correlation coefficient
+
+ Graphing calculators aimed at Secondary School students typically
+ include all of the above, plus some or all of:
+
+ - median
+ - mode
+ - functions for calculating the probability of random variables
+ from the normal, t, chi-squared, and F distributions
+ - inference on the mean
+
+ and others[2]. Likewise spreadsheet applications such as Microsoft Excel,
+ LibreOffice and Gnumeric include rich collections of statistical
+ functions[3].
+
+ In contrast, Python currently has no standard way to calculate even the
+ simplest and most obvious statistical functions such as mean. For those
+ who need statistical functions in Python, there are two obvious solutions:
+
+ - install numpy and/or scipy[4];
+
+ - or use a Do It Yourself solution.
+
+ Numpy is perhaps the most full-featured solution, but it has a few
+ disadvantages:
+
+ - It may be overkill for many purposes. The documentation for numpy even
+ warns
+
+ "It can be hard to know what functions are available in
+ numpy. This is not a complete list, but it does cover
+ most of them."[5]
+
+ and then goes on to list over 270 functions, only a small number of
+ which are related to statistics.
+
+ - Numpy is aimed at those doing heavy numerical work, and may be
+ intimidating to those who don't have a background in computational
+ mathematics and computer science. For example, numpy.mean takes four
+ arguments:
+
+ mean(a, axis=None, dtype=None, out=None)
+
+ although fortunately for the beginner or casual numpy user, three are
+ optional and numpy.mean does the right thing in simple cases:
+
+ >>> numpy.mean([1, 2, 3, 4])
+ 2.5
+
+ - For many people, installing numpy may be difficult or impossible. For
+ example, people in corporate environments may have to go through a
+ difficult, time-consuming process before being permitted to install
+ third-party software. For the casual Python user, having to learn about
+ installing third-party packages in order to average a list of numbers is
+ unfortunate.
+
+ This leads to option number 2, DIY statistics functions. At first glance,
+ this appears to be an attractive option, due to the apparent simplicity of
+ common statistical functions. For example:
+
+ def mean(data):
+ return sum(data)/len(data)
+
+ def variance(data):
+ # Use the Computational Formula for Variance.
+ n = len(data)
+ ss = sum(x**2 for x in data) - (sum(data)**2)/n
+ return ss/(n-1)
+
+ def standard_deviation(data):
+ return math.sqrt(variance(data))
+
+ The above appears to be correct with a casual test:
+
+ >>> data = [1, 2, 4, 5, 8]
+ >>> variance(data)
+ 7.5
+
+ But adding a constant to every data point should not change the variance:
+
+ >>> data = [x+1e12 for x in data]
+ >>> variance(data)
+ 0.0
+
+ And variance should *never* be negative:
+
+ >>> variance(data*100)
+ -1239429440.1282566
+
+ By contrast, the proposed reference implementation gets the exactly correct
+ answer 7.5 for the first two examples, and a reasonably close answer for
+ the third: 6.012. numpy does no better[6].
+
+ Even simple statistical calculations contain traps for the unwary, starting
+ with the Computational Formula itself. Despite the name, it is numerically
+ unstable and can be extremely inaccurate, as can be seen above. It is
+ completely unsuitable for computation by computer[7]. This problem plagues
+ users of many programming language, not just Python[8], as coders reinvent
+ the same numerically inaccurate code over and over again[9], or advise
+ others to do so[10].
+
+ It isn't just the variance and standard deviation. Even the mean is not
+ quite as straight-forward as it might appear. The above implementation
+ seems too simple to have problems, but it does:
+
+ - The built-in sum can lose accuracy when dealing with floats of wildly
+ differing magnitude. Consequently, the above naive mean fails this
+ "torture test":
+
+ assert mean([1e30, 1, 3, -1e30]) == 1
+
+ returning 0 instead of 1, a purely computational error of 100%.
+
+ - Using math.fsum inside mean will make it more accurate with float data,
+ but it also has the side-effect of converting any arguments to float
+ even when unnecessary. E.g. we should expect the mean of a list of
+ Fractions to be a Fraction, not a float.
+
+ While the above mean implementation does not fail quite as catastrophically
+ as the naive variance does, a standard library function can do much better
+ than the DIY versions.
+
+ The example above involves an especially bad set of data, but even for
+ more realistic data sets accuracy is important. The first step in
+ interpreting variation in data (including dealing with ill-conditioned
+ data) is often to standardize it to a series with variance 1 (and often
+ mean 0). This standardization requires accurate computation of the mean
+ and variance of the raw series. Naive computation of mean and variance
+ can lose precision very quickly. Because precision bounds accuracy, it is
+ important to use the most precise algorithms for computing mean and
+ variance that are practical, or the results of standardization are
+ themselves useless.
+
+
+Comparison To Other Languages/Packages
+
+ The proposed statistics library is not intended to be a competitor to such
+ third-party libraries as numpy/scipy, or of proprietary full-featured
+ statistics packages aimed at professional statisticians such as Minitab,
+ SAS and Matlab. It is aimed at the level of graphing and scientific
+ calculators.
+
+ Most programming languages have little or no built-in support for
+ statistics functions. Some exceptions:
+
+ R
+ R (and its proprietary cousin, S) is a programming language designed
+ for statistics work. It is extremely popular with statisticians and
+ is extremely feature-rich[11].
+
+ C#
+
+ The C# LINQ package includes extension methods to calculate the
+ average of enumerables[12].
+
+ Ruby
+
+ Ruby does not ship with a standard statistics module, despite some
+ apparent demand[13]. Statsample appears to be a feature-rich third-
+ party library, aiming to compete with R[14].
+
+ PHP
+
+ PHP has an extremely feature-rich (although mostly undocumented) set
+ of advanced statistical functions[15].
+
+ Delphi
+
+ Delphi includes standard statistical functions including Mean, Sum,
+ Variance, TotalVariance, MomentSkewKurtosis in its Math library[16].
+
+ GNU Scientific Library
+
+ The GNU Scientific Library includes standard statistical functions,
+ percentiles, median and others[17]. One innovation I have borrowed
+ from the GSL is to allow the caller to optionally specify the pre-
+ calculated mean of the sample (or an a priori known population mean)
+ when calculating the variance and standard deviation[18].
+
+
+Design Decisions Of The Module
+
+ My intention is to start small and grow the library as needed, rather than
+ try to include everything from the start. Consequently, the current
+ reference implementation includes only a small number of functions: mean,
+ variance, standard deviation, median, mode. (See the reference
+ implementation for a full list.)
+
+ I have aimed for the following design features:
+
+ - Correctness over speed. It is easier to speed up a correct but slow
+ function than to correct a fast but buggy one.
+
+ - Concentrate on data in sequences, allowing two-passes over the data,
+ rather than potentially compromise on accuracy for the sake of a one-pass
+ algorithm. Functions expect data will be passed as a list or other
+ sequence; if given an iterator, they may internally convert to a list.
+
+ - Functions should, as much as possible, honour any type of numeric data.
+ E.g. the mean of a list of Decimals should be a Decimal, not a float.
+ When this is not possible, treat float as the "lowest common data type".
+
+ - Although functions support data sets of floats, Decimals or Fractions,
+ there is no guarantee that *mixed* data sets will be supported. (But on
+ the other hand, they aren't explicitly rejected either.)
+
+ - Plenty of documentation, aimed at readers who understand the basic
+ concepts but may not know (for example) which variance they should use
+ (population or sample?). Mathematicians and statisticians have a terrible
+ habit of being inconsistent with both notation and terminology[19], and
+ having spent many hours making sense of the contradictory/confusing
+ definitions in use, it is only fair that I do my best to clarify rather
+ than obfuscate the topic.
+
+ - But avoid going into tedious[20] mathematical detail.
+
+
+Specification
+
+ As the proposed reference implementation is in pure Python,
+ other Python implementations can easily make use of the module
+ unchanged, or adapt it as they see fit.
+
+
+What Should Be The Name Of The Module?
+
+ This will be a top-level module "statistics".
+
+ There was some interest in turning math into a package, and making this a
+ sub-module of math, but the general consensus eventually agreed on a
+ top-level module. Other potential but rejected names included "stats" (too
+ much risk of confusion with existing "stat" module), and "statslib"
+ (described as "too C-like").
+
+
+Previous Discussions
+
+ This proposal has been previously discussed here[21].
+
+
+Frequently Asked Questions
+
+ Q: Shouldn't this module spend time on PyPI before being considered for
+ the standard library?
+
+ A: Older versions of this module have been available on PyPI[22] since
+ 2010. Being much simpler than numpy, it does not require many years of
+ external development.
+
+ Q: Does the standard library really need yet another version of ``sum``?
+
+ A: This proved to be the most controversial part of the reference
+ implementation. In one sense, clearly three sums is two too many. But
+ in another sense, yes. The reasons why the two existing versions are
+ unsuitable are described here[23] but the short summary is:
+
+ - the built-in sum can lose precision with floats;
+
+ - the built-in sum accepts any non-numeric data type that supports
+ the + operator, apart from strings and bytes;
+
+ - math.fsum is high-precision, but coerces all arguments to float.
+
+ There is some interest in "fixing" one or the other of the existing
+ sums. If this occurs before 3.4 feature-freeze, the decision to keep
+ statistics.sum can be re-considered.
+
+ Q: Will this module be backported to older versions of Python?
+
+ A: The module currently targets 3.3, and I will make it available on PyPI
+ for 3.3 for the foreseeable future. Backporting to older versions of
+ the 3.x series is likely (but not yet decided). Backporting to 2.7 is
+ less likely but not ruled out.
+
+ Q: Is this supposed to replace numpy?
+
+ A: No. While it is likely to grow over the years (see open issues below)
+ it is not aimed to replace, or even compete directly with, numpy. Numpy
+ is a full-featured numeric library aimed at professionals, the nuclear
+ reactor of numeric libraries in the Python ecosystem. This is just a
+ battery, as in "batteries included", and is aimed at an intermediate
+ level somewhere between "use numpy" and "roll your own version".
+
+
+Open and Deferred Issues
+
+ - At this stage, I am unsure of the best API for multivariate statistical
+ functions such as linear regression, correlation coefficient, and
+ covariance. Possible APIs include:
+
+ * Separate arguments for x and y data:
+ function([x0, x1, ...], [y0, y1, ...])
+
+ * A single argument for (x, y) data:
+ function([(x0, y0), (x1, y1), ...])
+
+ * Selecting arbitrary columns from a 2D array:
+ function([[a0, x0, y0, z0], [a1, x1, y1, z1], ...], x=1, y=2)
+
+ * Some combination of the above.
+
+ In the absence of a consensus of preferred API for multivariate stats,
+ I will defer including such multivariate functions until Python 3.5.
+
+ - Likewise, functions for calculating probability of random variables and
+ inference testing (e.g. Student's t-test) will be deferred until 3.5.
+
+ - There is considerable interest in including one-pass functions that can
+ calculate multiple statistics from data in iterator form, without having
+ to convert to a list. The experimental "stats" package on PyPI includes
+ co-routine versions of statistics functions. Including these will be
+ deferred to 3.5.
+
+
+References
+
+ [1] http://mail.python.org/pipermail/python-dev/2010-October/104721.html
+
+ [2] http://support.casio.com/pdf/004/CP330PLUSver310_Soft_E.pdf
+
+ [3] Gnumeric:
+ https://projects.gnome.org/gnumeric/functions.shtml
+
+ LibreOffice:
+ https://help.libreoffice.org/Calc/Statistical_Functions_Part_One
+ https://help.libreoffice.org/Calc/Statistical_Functions_Part_Two
+ https://help.libreoffice.org/Calc/Statistical_Functions_Part_Three
+ https://help.libreoffice.org/Calc/Statistical_Functions_Part_Four
+ https://help.libreoffice.org/Calc/Statistical_Functions_Part_Five
+
+ [4] Scipy: http://scipy-central.org/
+ Numpy: http://www.numpy.org/
+
+ [5] http://wiki.scipy.org/Numpy_Functions_by_Category
+
+ [6] Tested with numpy 1.6.1 and Python 2.7.
+
+ [7] http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/
+
+ [8] http://rosettacode.org/wiki/Standard_deviation
+
+ [9] https://bitbucket.org/larsyencken/simplestats/src/c42e048a6625/src/basic.py
+
+ [10] http://stackoverflow.com/questions/2341340/calculate-mean-and-variance-with-one-iteration
+
+ [11] http://www.r-project.org/
+
+ [12] http://msdn.microsoft.com/en-us/library/system.linq.enumerable.average.aspx
+
+ [13] https://www.bcg.wisc.edu/webteam/support/ruby/standard_deviation
+
+ [14] http://ruby-statsample.rubyforge.org/
+
+ [15] http://www.php.net/manual/en/ref.stats.php
+
+ [16] http://www.ayton.id.au/gary/it/Delphi/D_maths.htm#Delphi%20Statistical%20functions.
+
+ [17] http://www.gnu.org/software/gsl/manual/html_node/Statistics.html
+
+ [18] http://www.gnu.org/software/gsl/manual/html_node/Mean-and-standard-deviation-and-variance.html
+
+ [19] http://mathworld.wolfram.com/Skewness.html
+
+ [20] At least, tedious to those who don't like this sort of thing.
+
+ [21] http://mail.python.org/pipermail/python-ideas/2011-September/011524.html
+
+ [22] https://pypi.python.org/pypi/stats/
+
+ [23] http://mail.python.org/pipermail/python-ideas/2013-August/022630.html
+
+
+Copyright
+
+ This document has been placed in the public domain.
+
+
+
+Local Variables:
+mode: indented-text
+indent-tabs-mode: nil
+sentence-end-double-space: t
+fill-column: 70
+coding: utf-8
+End:
--
Repository URL: http://hg.python.org/peps
More information about the Python-checkins
mailing list