[Numpy-discussion] please change mean to use dtype=float
Tim Hochberg
tim.hochberg at ieee.org
Thu Sep 21 13:56:38 EDT 2006
Robert Kern wrote:
> David M. Cooke wrote:
>
>> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
>>
>>> Let me offer a third path: the algorithms used for .mean() and .var() are
>>> substandard. There are much better incremental algorithms that entirely avoid
>>> the need to accumulate such large (and therefore precision-losing) intermediate
>>> values. The algorithms look like the following for 1D arrays in Python:
>>>
>>> def mean(a):
>>> m = a[0]
>>> for i in range(1, len(a)):
>>> m += (a[i] - m) / (i + 1)
>>> return m
>>>
>> This isn't really going to be any better than using a simple sum.
>> It'll also be slower (a division per iteration).
>>
>
> With one exception, every test that I've thrown at it shows that it's better for
> float32. That exception is uniformly spaced arrays, like linspace().
>
> > You do avoid
> > accumulating large sums, but then doing the division a[i]/len(a) and
> > adding that will do the same.
>
> Okay, this is true.
>
>
>> Now, if you want to avoid losing precision, you want to use a better
>> summation technique, like compensated (or Kahan) summation:
>>
>> def mean(a):
>> s = e = a.dtype.type(0)
>> for i in range(0, len(a)):
>> temp = s
>> y = a[i] + e
>> s = temp + y
>> e = (temp - s) + y
>> return s / len(a)
>>
>> Some numerical experiments in Maple using 5-digit precision show that
>> your mean is maybe a bit better in some cases, but can also be much
>> worse, than sum(a)/len(a), but both are quite poor in comparision to the
>> Kahan summation.
>>
>> (We could probably use a fast implementation of Kahan summation in
>> addition to a.sum())
>>
>
> +1
>
>
>>> def var(a):
>>> m = a[0]
>>> t = a.dtype.type(0)
>>> for i in range(1, len(a)):
>>> q = a[i] - m
>>> r = q / (i+1)
>>> m += r
>>> t += i * q * r
>>> t /= len(a)
>>> return t
>>>
>>> Alternatively, from Knuth:
>>>
>>> def var_knuth(a):
>>> m = a.dtype.type(0)
>>> variance = a.dtype.type(0)
>>> for i in range(len(a)):
>>> delta = a[i] - m
>>> m += delta / (i+1)
>>> variance += delta * (a[i] - m)
>>> variance /= len(a)
>>> return variance
>>>
>> These formulas are good when you can only do one pass over the data
>> (like in a calculator where you don't store all the data points), but
>> are slightly worse than doing two passes. Kahan summation would probably
>> also be good here too.
>>
>
> Again, my tests show otherwise for float32. I'll condense my ipython log into a
> module for everyone's perusal. It's possible that the Kahan summation of the
> squared residuals will work better than the current two-pass algorithm and the
> implementations I give above.
>
This is what my tests show as well var_knuth outperformed any simple two
pass algorithm I could come up with, even ones using Kahan sums.
Interestingly, for 1D arrays the built in float32 variance performs
better than it should. After a bit of twiddling around I discovered that
it actually does most of it's calculations in float64. It uses a two
pass calculation, the result of mean is a scalar, and in the process of
converting that back to an array we end up with float64 values. Or
something like that; I was mostly reverse engineering the sequence of
events from the results.
-tim
More information about the NumPy-Discussion
mailing list