[Python-checkins] peps: Add PEP 450: Adding A Statistics Module To The Standard Library, by Steven

brett.cannon python-checkins at python.org
Fri Aug 9 16:46:59 CEST 2013


http://hg.python.org/peps/rev/d8e0108ba02c
changeset:   5041:d8e0108ba02c
user:        Brett Cannon <brett at python.org>
date:        Fri Aug 09 10:46:53 2013 -0400
summary:
  Add PEP 450: Adding A Statistics Module To The Standard Library, by Steven D'Aprano

files:
  pep-0450.txt |  420 +++++++++++++++++++++++++++++++++++++++
  1 files changed, 420 insertions(+), 0 deletions(-)


diff --git a/pep-0450.txt b/pep-0450.txt
new file mode 100644
--- /dev/null
+++ b/pep-0450.txt
@@ -0,0 +1,420 @@
+PEP: 450
+Title: Adding A Statistics Module To The Standard Library
+Version: $Revision$
+Last-Modified: $Date$
+Author: Steven D'Aprano <steve at pearwood.info>
+Status: Draft
+Type: Standards Track
+Content-Type: text/plain
+Created: 01-Aug-2013
+Python-Version: 3.4
+Post-History:
+
+
+Abstract
+
+    This PEP proposes the addition of a module for common statistics functions
+    such as mean, median, variance and standard deviation to the Python
+    standard library.
+
+
+Rationale
+
+    The proposed statistics module is motivated by the "batteries included"
+    philosophy towards the Python standard library.  Raymond Hettinger and
+    other senior developers have requested a quality statistics library that
+    falls somewhere in between high-end statistics libraries and ad hoc
+    code.[1]  Statistical functions such as mean, standard deviation and others
+    are obvious and useful batteries, familiar to any Secondary School student.
+    Even cheap scientific calculators typically include multiple statistical
+    functions such as:
+
+    - mean
+    - population and sample variance
+    - population and sample standard deviation
+    - linear regression
+    - correlation coefficient
+
+    Graphing calculators aimed at Secondary School students typically
+    include all of the above, plus some or all of:
+
+    - median
+    - mode
+    - functions for calculating the probability of random variables
+      from the normal, t, chi-squared, and F distributions
+    - inference on the mean
+
+    and others[2].  Likewise spreadsheet applications such as Microsoft Excel,
+    LibreOffice and Gnumeric include rich collections of statistical
+    functions[3].
+
+    In contrast, Python currently has no standard way to calculate even the
+    simplest and most obvious statistical functions such as mean.  For those
+    who need statistical functions in Python, there are two obvious solutions:
+
+    - install numpy and/or scipy[4];
+
+    - or use a Do It Yourself solution.
+
+    Numpy is perhaps the most full-featured solution, but it has a few
+    disadvantages:
+
+    - It may be overkill for many purposes.  The documentation for numpy even
+      warns
+
+          "It can be hard to know what functions are available in
+          numpy.  This is not a complete list, but it does cover
+          most of them."[5]
+
+      and then goes on to list over 270 functions, only a small number of
+      which are related to statistics.
+
+    - Numpy is aimed at those doing heavy numerical work, and may be
+      intimidating to those who don't have a background in computational
+      mathematics and computer science.  For example, numpy.mean takes four
+      arguments:
+
+        mean(a, axis=None, dtype=None, out=None)
+
+      although fortunately for the beginner or casual numpy user, three are
+      optional and numpy.mean does the right thing in simple cases:
+
+          >>>  numpy.mean([1, 2, 3, 4])
+          2.5
+
+    - For many people, installing numpy may be difficult or impossible.  For
+      example, people in corporate environments may have to go through a
+      difficult, time-consuming process before being permitted to install
+      third-party software.  For the casual Python user, having to learn about
+      installing third-party packages in order to average a list of numbers is
+      unfortunate.
+
+    This leads to option number 2, DIY statistics functions.  At first glance,
+    this appears to be an attractive option, due to the apparent simplicity of
+    common statistical functions.  For example:
+
+        def mean(data):
+            return sum(data)/len(data)
+
+        def variance(data):
+            # Use the Computational Formula for Variance.
+            n = len(data)
+            ss = sum(x**2 for x in data) - (sum(data)**2)/n
+            return ss/(n-1)
+
+        def standard_deviation(data):
+            return math.sqrt(variance(data))
+
+    The above appears to be correct with a casual test:
+
+        >>> data = [1, 2, 4, 5, 8]
+        >>> variance(data)
+        7.5
+
+    But adding a constant to every data point should not change the variance:
+
+        >>> data = [x+1e12 for x in data]
+        >>> variance(data)
+        0.0
+
+    And variance should *never* be negative:
+
+        >>> variance(data*100)
+        -1239429440.1282566
+
+    By contrast, the proposed reference implementation gets the exactly correct
+    answer 7.5 for the first two examples, and a reasonably close answer for
+    the third: 6.012. numpy does no better[6].
+
+    Even simple statistical calculations contain traps for the unwary, starting
+    with the Computational Formula itself.  Despite the name, it is numerically
+    unstable and can be extremely inaccurate, as can be seen above.  It is
+    completely unsuitable for computation by computer[7].  This problem plagues
+    users of many programming language, not just Python[8], as coders reinvent
+    the same numerically inaccurate code over and over again[9], or advise
+    others to do so[10].
+
+    It isn't just the variance and standard deviation. Even the mean is not
+    quite as straight-forward as it might appear.  The above implementation
+    seems too simple to have problems, but it does:
+
+    - The built-in sum can lose accuracy when dealing with floats of wildly
+      differing magnitude.  Consequently, the above naive mean fails this
+      "torture test":
+
+          assert mean([1e30, 1, 3, -1e30]) == 1
+
+      returning 0 instead of 1, a purely computational error of 100%.
+  
+    - Using math.fsum inside mean will make it more accurate with float data,
+      but it also has the side-effect of converting any arguments to float
+      even when unnecessary.  E.g. we should expect the mean of a list of
+      Fractions to be a Fraction, not a float.
+
+    While the above mean implementation does not fail quite as catastrophically
+    as the naive variance does, a standard library function can do much better
+    than the DIY versions.
+
+    The example above involves an especially bad set of data, but even for
+    more realistic data sets accuracy is important.  The first step in
+    interpreting variation in data (including dealing with ill-conditioned
+    data) is often to standardize it to a series with variance 1 (and often
+    mean 0).  This standardization requires accurate computation of the mean
+    and variance of the raw series.  Naive computation of mean and variance
+    can lose precision very quickly.  Because precision bounds accuracy, it is
+    important to use the most precise algorithms for computing mean and
+    variance that are practical, or the results of standardization are
+    themselves useless.
+
+
+Comparison To Other Languages/Packages
+
+    The proposed statistics library is not intended to be a competitor to such
+    third-party libraries as numpy/scipy, or of proprietary full-featured
+    statistics packages aimed at professional statisticians such as Minitab,
+    SAS and Matlab.  It is aimed at the level of graphing and scientific
+    calculators.
+
+    Most programming languages have little or no built-in support for
+    statistics functions.  Some exceptions:
+
+    R
+        R (and its proprietary cousin, S) is a programming language designed
+        for statistics work. It is extremely popular with statisticians and
+        is extremely feature-rich[11].
+
+    C#
+
+        The C# LINQ package includes extension methods to calculate the
+        average of enumerables[12].
+
+    Ruby
+
+        Ruby does not ship with a standard statistics module, despite some
+        apparent demand[13].  Statsample appears to be a feature-rich third-
+        party library, aiming to compete with R[14].
+
+    PHP
+
+        PHP has an extremely feature-rich (although mostly undocumented) set
+        of advanced statistical functions[15].
+
+    Delphi
+
+        Delphi includes standard statistical functions including Mean, Sum,
+        Variance, TotalVariance, MomentSkewKurtosis in its Math library[16].
+
+    GNU Scientific Library
+
+        The GNU Scientific Library includes standard statistical functions,
+        percentiles, median and others[17].  One innovation I have borrowed
+        from the GSL is to allow the caller to optionally specify the pre-
+        calculated mean of the sample (or an a priori known population mean)
+        when calculating the variance and standard deviation[18].
+
+
+Design Decisions Of The Module
+
+    My intention is to start small and grow the library as needed, rather than
+    try to include everything from the start. Consequently, the current
+    reference implementation includes only a small number of functions: mean,
+    variance, standard deviation, median, mode. (See the reference
+    implementation for a full list.)
+
+    I have aimed for the following design features:
+
+    - Correctness over speed.  It is easier to speed up a correct but slow
+      function than to correct a fast but buggy one.
+
+    - Concentrate on data in sequences, allowing two-passes over the data,
+      rather than potentially compromise on accuracy for the sake of a one-pass
+      algorithm.  Functions expect data will be passed as a list or other
+      sequence; if given an iterator, they may internally convert to a list.
+
+    - Functions should, as much as possible, honour any type of numeric data.
+      E.g. the mean of a list of Decimals should be a Decimal, not a float.
+      When this is not possible, treat float as the "lowest common data type".
+
+    - Although functions support data sets of floats, Decimals or Fractions,
+      there is no guarantee that *mixed* data sets will be supported. (But on
+      the other hand, they aren't explicitly rejected either.)
+
+    - Plenty of documentation, aimed at readers who understand the basic
+      concepts but may not know (for example) which variance they should use
+      (population or sample?). Mathematicians and statisticians have a terrible
+      habit of being inconsistent with both notation and terminology[19], and
+      having spent many hours making sense of the contradictory/confusing
+      definitions in use, it is only fair that I do my best to clarify rather
+      than obfuscate the topic.
+
+    - But avoid going into tedious[20] mathematical detail.
+
+
+Specification
+
+    As the proposed reference implementation is in pure Python,
+    other Python implementations can easily make use of the module
+    unchanged, or adapt it as they see fit.
+
+
+What Should Be The Name Of The Module?
+
+    This will be a top-level module "statistics".
+
+    There was some interest in turning math into a package, and making this a
+    sub-module of math, but the general consensus eventually agreed on a
+    top-level module.  Other potential but rejected names included "stats" (too
+    much risk of confusion with existing "stat" module), and "statslib"
+    (described as "too C-like").
+
+
+Previous Discussions
+
+    This proposal has been previously discussed here[21].
+
+
+Frequently Asked Questions
+
+    Q: Shouldn't this module spend time on PyPI before being considered for
+       the standard library?
+
+    A: Older versions of this module have been available on PyPI[22] since
+       2010. Being much simpler than numpy, it does not require many years of
+       external development.
+
+    Q: Does the standard library really need yet another version of ``sum``?
+
+    A: This proved to be the most controversial part of the reference
+       implementation.  In one sense, clearly three sums is two too many.  But
+       in another sense, yes.  The reasons why the two existing versions are
+       unsuitable are described here[23] but the short summary is:
+
+       - the built-in sum can lose precision with floats;
+
+       - the built-in sum accepts any non-numeric data type that supports
+         the + operator, apart from strings and bytes;
+
+       - math.fsum is high-precision, but coerces all arguments to float.
+
+       There is some interest in "fixing" one or the other of the existing
+       sums. If this occurs before 3.4 feature-freeze, the decision to keep
+       statistics.sum can be re-considered.
+
+    Q: Will this module be backported to older versions of Python?
+
+    A: The module currently targets 3.3, and I will make it available on PyPI
+       for 3.3 for the foreseeable future. Backporting to older versions of
+       the 3.x series is likely (but not yet decided). Backporting to 2.7 is
+       less likely but not ruled out.
+
+    Q: Is this supposed to replace numpy?
+
+    A: No. While it is likely to grow over the years (see open issues below)
+       it is not aimed to replace, or even compete directly with, numpy. Numpy
+       is a full-featured numeric library aimed at professionals, the nuclear
+       reactor of numeric libraries in the Python ecosystem. This is just a
+       battery, as in "batteries included", and is aimed at an intermediate
+       level somewhere between "use numpy" and "roll your own version".
+
+
+Open and Deferred Issues
+
+    - At this stage, I am unsure of the best API for multivariate statistical
+      functions such as linear regression, correlation coefficient, and
+      covariance. Possible APIs include:
+
+        * Separate arguments for x and y data:
+          function([x0, x1, ...], [y0, y1, ...])
+
+        * A single argument for (x, y) data:
+          function([(x0, y0), (x1, y1), ...])
+
+        * Selecting arbitrary columns from a 2D array:
+          function([[a0, x0, y0, z0], [a1, x1, y1, z1], ...], x=1, y=2)
+
+        * Some combination of the above.
+
+      In the absence of a consensus of preferred API for multivariate stats,
+      I will defer including such multivariate functions until Python 3.5.
+
+    - Likewise, functions for calculating probability of random variables and
+      inference testing (e.g. Student's t-test) will be deferred until 3.5.
+
+    - There is considerable interest in including one-pass functions that can
+      calculate multiple statistics from data in iterator form, without having
+      to convert to a list. The experimental "stats" package on PyPI includes
+      co-routine versions of statistics functions. Including these will be
+      deferred to 3.5.
+
+
+References
+
+    [1] http://mail.python.org/pipermail/python-dev/2010-October/104721.html
+
+    [2] http://support.casio.com/pdf/004/CP330PLUSver310_Soft_E.pdf
+
+    [3] Gnumeric:
+            https://projects.gnome.org/gnumeric/functions.shtml
+
+        LibreOffice:
+            https://help.libreoffice.org/Calc/Statistical_Functions_Part_One
+            https://help.libreoffice.org/Calc/Statistical_Functions_Part_Two
+            https://help.libreoffice.org/Calc/Statistical_Functions_Part_Three
+            https://help.libreoffice.org/Calc/Statistical_Functions_Part_Four
+            https://help.libreoffice.org/Calc/Statistical_Functions_Part_Five
+
+    [4] Scipy: http://scipy-central.org/
+        Numpy: http://www.numpy.org/
+
+    [5] http://wiki.scipy.org/Numpy_Functions_by_Category
+
+    [6] Tested with numpy 1.6.1 and Python 2.7.
+
+    [7] http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/
+
+    [8] http://rosettacode.org/wiki/Standard_deviation
+
+    [9] https://bitbucket.org/larsyencken/simplestats/src/c42e048a6625/src/basic.py
+
+    [10] http://stackoverflow.com/questions/2341340/calculate-mean-and-variance-with-one-iteration
+
+    [11] http://www.r-project.org/
+
+    [12] http://msdn.microsoft.com/en-us/library/system.linq.enumerable.average.aspx
+
+    [13] https://www.bcg.wisc.edu/webteam/support/ruby/standard_deviation
+
+    [14] http://ruby-statsample.rubyforge.org/
+
+    [15] http://www.php.net/manual/en/ref.stats.php
+
+    [16] http://www.ayton.id.au/gary/it/Delphi/D_maths.htm#Delphi%20Statistical%20functions.
+
+    [17] http://www.gnu.org/software/gsl/manual/html_node/Statistics.html
+
+    [18] http://www.gnu.org/software/gsl/manual/html_node/Mean-and-standard-deviation-and-variance.html
+
+    [19] http://mathworld.wolfram.com/Skewness.html
+
+    [20] At least, tedious to those who don't like this sort of thing.
+
+    [21] http://mail.python.org/pipermail/python-ideas/2011-September/011524.html
+
+    [22] https://pypi.python.org/pypi/stats/
+
+    [23] http://mail.python.org/pipermail/python-ideas/2013-August/022630.html
+
+
+Copyright
+
+    This document has been placed in the public domain.
+
+
+
+Local Variables:
+mode: indented-text
+indent-tabs-mode: nil
+sentence-end-double-space: t
+fill-column: 70
+coding: utf-8
+End:

-- 
Repository URL: http://hg.python.org/peps


More information about the Python-checkins mailing list