[Python-ideas] Pre-PEP: adding a statistics module to Python

Stephen J. Turnbull stephen at xemacs.org
Mon Aug 5 04:59:06 CEST 2013


I couldn't find a list of functions proposed for inclusion in the
statistics package in the pre-PEP, only lists of functions in other
implementations that "suggest" the content of this package.  Did I
miss something?

I can't agree with your rationale for inclusion based on the
imprecision in math.sum.  (That doesn't mean I'm opposed to inclusion,
but it does cause me to raise some questions below.)  Correcting the
numerical instability issues you describe doesn't result in
improvement in statistical accuracy in the applications I'm aware of.
Rather, you're effectively assuming that data values are given with
infinite precision and infinite accuracy.  Are there any applications
of statistics where that assumption makes sense?  And some of your
arguments are basically incorrect when considered from the standpoint
of *interpreting*, rather than *computing*, statistics:

Steven D'Aprano writes:

 >     - The built-in sum can lose accuracy when dealing with floats of wildly
 >       differing magnitude.  Consequently, the above naive mean fails this
 >       "torture test" with an error of 100%:
 > 
 >           assert mean([1e30, 1, 3, -1e30]) == 1

100%?  This is a relative error of sqrt(2)*1e-30.  The mean is simply
not an appropriate choice of unit in statistics, especially not when
it's 0 to 30 decimal places in standard deviation units.

 >     - Using math.fsum inside mean will make it more accurate with
 >       float data,

Not necessarily.  It will be more statistically accurate if
statistical accuracy == numerical precision, but in most statistical
applications this is nowhere near the case.

My point throughout is that if high-precision calculation matters in
statistics, you've got more fundamental problems in your data than
precision of calculation can address.  Garbage in, garbage out applies
no matter how good the algorithms are.

So I would throw out all these appealing arguments that depend on
confounding numerical accuracy and statistical accuracy, and replace
it with a correct argument showing how precision does matter in
statistical interpretation:

    The first step in interpreting variation in data (including
    dealing with ill-conditioned data) is standardization of the data
    to a series with variance 1 (and often, mean 0).  Standardization
    requires accurate computation of tne mean and standard deviation of
    the raw series.  However, naive computation of mean and standard
    deviation can lose precision very quickly.  Because precision
    bounds accuracy, it is important to use the most precise possible
    algorithms for computing mean and standard deviation, or the
    results of standardization are themselves useless.

This (in combination with your examples) makes it clear why having
such functions in Python makes sense.

However, it remains unclear to me that other statistical functions are
really needed.  Without having actually thought about it[1], I suspect
to think that replacing math.sum with the proposed statistics.sum,
adding mean and standard_deviation functions to math, and moving the
existing math.sum to math.faster_sum would be sufficient to address
the real needs here.  (Of course, math.faster_sum should be documented
to be avoided in applications where ill-conditioned data might arise
-- this includes any case, such as computing variance, where a series
is generated as the difference of two series with similar means.)

I also wonder about the utility of a "statistics" package that has no
functionality for presenting and operating on the most fundamental
"statistic" of all: the (empirical) distribution.  Eg my own
statistics package will *never* suffer from ill-conditioned data (it's
only used for dealing with generated series of 10-100 data points with
a maximum dynamic range of about 100 to 1), but it's important for my
purposes to be able to flexibly deal with distributions (computing
modes and arbitrary percentiles, "bootstrap" random functions,
recognize multimodality, generate histograms, etc).  That's only an
example, specific to teaching (and I use spreadsheets and R, not
Python, for demonstrations of actual computational applications).

I think the wide variety of applications of distributions merits
consideration of their inclusion in a "batteries included" statistical
package.


Footnotes: 
[1]  Because the PEP doesn't specify a list of functions.


More information about the Python-ideas mailing list