Compact Python library for math statistics

Raymond Hettinger python at rcn.com
Fri Apr 9 04:24:58 EDT 2004


> A statistics module will
> be nice to have, although it is easy to write your own.
> 
> Here is a minor suggestion. The functions 'mean' and 'variance' are
> separate, and the latter function requires a mean to be calculated. To
> save CPU time, it would be nice to have a single function that returns
> both the mean and variance, or a function to compute the variance with
> a known mean.

Like you said, that is easy enough to write on your own.  This
lightweight module is not meant to replace heavy-weights that already
exist outside of the core distribution.

The goals are to have a simple set of functions for daily use and for
these data reduction functions to work as well as possible with
generator expression (one-pass over the data whereever possibe).



> (1) In computing the median, there is a line of code
> 
>     return (select(data, n//2) + select(data, n//2-1)) / 2
> 
> I think finding the 500th and 501st elements separately out of a 1000
> element array is inefficient. Isn't there a way to get consecutive
> ordered elements in about the same time needed to get a single
> element?

Select uses an O(n) algorithm, so they penalty is not that much. 
Making it accomodate selecting a range would greatly complicate and
slow down the code.  If you need the low, high, percentiles, then it
may be better to just sort the data.



> (2) The following code crashes when median(x) is computed. Why?
> 
> from statistics import mean,median
> x = [1.0,2.0,3.0,4.0]
> print mean(x)
> print median(x)

Hmm, it works for me.  What does your traceback look like?



> (3) The standard deviation is computed as 
> 
>     return variance(data, sample) ** 0.5
> 
> I think the sqrt function should be used instead -- this may be
> implemented more efficiently than general exponentiation.

The timings show otherwise:

C:\pydev>python timeit.py -r9 -n100000 -s "import math;
sqrt=math.sqrt" "sqrt(7.0)"
100000 loops, best of 9: 1.7 usec per loop

C:\pydev>python timeit.py -r9 -n100000 -s "7.0 ** 0.5"
100000 loops, best of 9: 0.237 usec per loop



Raymond Hettinger



More information about the Python-list mailing list