[Python-ideas] Pre-PEP: adding a statistics module to Python

Tue Aug 6 04:10:37 CEST 2013

On 05/08/13 12:59, Stephen J. Turnbull wrote:
> I couldn't find a list of functions proposed for inclusion in the
> statistics package in the pre-PEP, only lists of functions in other
> implementations that "suggest" the content of this package.  Did I
> miss something?

Not really. I haven't seen the full public API of modules listed in other PEPs, so I didn't include it in mine. Perhaps I didn't look hard enough.

Here's the current public API:

- add_partial           Utility for performing high-precision sums.
- mean                  Arithmetic mean (average) of data.
- median                Median (middle value) of data.
- median.high           Median, taking the high value in ties.
- median.low            Median, taking the low value in ties.
- median.grouped        Median, adjusting for grouped data.
- mode                  Mode (most common value) of data.
- mode.collate          Helper for mode.
- mode.extract          Helper for mode.
- pstdev                Population standard deviation of data.
- pvariance             Population variance of data.
- StatisticsError       Exception for statistics errors.
- stdev                 Sample standard deviation of data.
- sum                   High-precision sum of data.
- variance              Sample variance of data.

After discussion with Oscar, I am leaning towards changing the API for mode, so mode.collate and mode.extract may not survive.

[...]
> And some of your
> arguments are basically incorrect when considered from the standpoint
> of *interpreting*, rather than *computing*, statistics:
>
> Steven D'Aprano writes:
>
>   >     - The built-in sum can lose accuracy when dealing with floats of wildly
>   >       differing magnitude.  Consequently, the above naive mean fails this
>   >       "torture test" with an error of 100%:
>   >
>   >           assert mean([1e30, 1, 3, -1e30]) == 1
>
> 100%?  This is a relative error of sqrt(2)*1e-30.

I don't understand your calculation here. Where are you getting the values 2 and 1e-30 from? The exact value of the arithmetic mean of the four values given is exactly 1. (Total of 4, divided by 4, is 1. The calculated value is 0, which is an absolute error of 1, or a relative error of (1-0)/1 = 100%.

[...]
> So I would throw out all these appealing arguments that depend on
> confounding numerical accuracy and statistical accuracy, and replace
> it with a correct argument showing how precision does matter in
> statistical interpretation:
>
>      The first step in interpreting variation in data (including
>      dealing with ill-conditioned data) is standardization of the data
>      to a series with variance 1 (and often, mean 0).  Standardization
>      requires accurate computation of tne mean and standard deviation of
>      the raw series.  However, naive computation of mean and standard
>      deviation can lose precision very quickly.  Because precision
>      bounds accuracy, it is important to use the most precise possible
>      algorithms for computing mean and standard deviation, or the
>      results of standardization are themselves useless.

Thanks for the contribution.

[...]
> I also wonder about the utility of a "statistics" package that has no
> functionality for presenting and operating on the most fundamental
> "statistic" of all: the (empirical) distribution.  Eg my own
> statistics package will *never* suffer from ill-conditioned data (it's
> only used for dealing with generated series of 10-100 data points with
> a maximum dynamic range of about 100 to 1), but it's important for my
> purposes to be able to flexibly deal with distributions (computing
> modes and arbitrary percentiles, "bootstrap" random functions,
> recognize multimodality, generate histograms, etc).  That's only an
> example, specific to teaching (and I use spreadsheets and R, not
> Python, for demonstrations of actual computational applications).

It's early days, and it is better to start the module small and grow it than to try to fit everything and the kitchen sink in from Day One.

> I think the wide variety of applications of distributions merits
> consideration of their inclusion in a "batteries included" statistical
> package.

I'm happy to discuss this further with you off-list.

-- 
Steven