[Numpy-discussion] please change mean to use dtype=float

Thu Sep 21 17:59:37 EDT 2006

On Thu, 21 Sep 2006 11:34:42 -0700
Tim Hochberg <tim.hochberg at ieee.org> wrote:

> Tim Hochberg wrote:
> > Robert Kern wrote:
> >   
> >> David M. Cooke wrote:
> >>   
> >>     
> >>> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
> >>>     
> >>>       
> >>>> Let me offer a third path: the algorithms used for .mean() and .var()
> >>>> are substandard. There are much better incremental algorithms that
> >>>> entirely avoid the need to accumulate such large (and therefore
> >>>> precision-losing) intermediate values. The algorithms look like the
> >>>> following for 1D arrays in Python:
> >>>>
> >>>> def mean(a):
> >>>>      m = a[0]
> >>>>      for i in range(1, len(a)):
> >>>>          m += (a[i] - m) / (i + 1)
> >>>>      return m
> >>>>       
> >>>>         
> >>> This isn't really going to be any better than using a simple sum.
> >>> It'll also be slower (a division per iteration).
> >>>     
> >>>       
> >> With one exception, every test that I've thrown at it shows that it's
> >> better for float32. That exception is uniformly spaced arrays, like
> >> linspace().
> >>
> >>  > You do avoid
> >>  > accumulating large sums, but then doing the division a[i]/len(a) and
> >>  > adding that will do the same.
> >>
> >> Okay, this is true.
> >>
> >>   
> >>     
> >>> Now, if you want to avoid losing precision, you want to use a better
> >>> summation technique, like compensated (or Kahan) summation:
> >>>
> >>> def mean(a):
> >>>     s = e = a.dtype.type(0)
> >>>     for i in range(0, len(a)):
> >>>         temp = s
> >>>         y = a[i] + e
> >>>         s = temp + y
> >>>         e = (temp - s) + y
> >>>     return s / len(a)
> >>     
> >>>> def var(a):
> >>>>      m = a[0]
> >>>>      t = a.dtype.type(0)
> >>>>      for i in range(1, len(a)):
> >>>>          q = a[i] - m
> >>>>          r = q / (i+1)
> >>>>          m += r
> >>>>          t += i * q * r
> >>>>      t /= len(a)
> >>>>      return t
> >>>>
> >>>> Alternatively, from Knuth:
> >>>>
> >>>> def var_knuth(a):
> >>>>      m = a.dtype.type(0)
> >>>>      variance = a.dtype.type(0)
> >>>>      for i in range(len(a)):
> >>>>          delta = a[i] - m
> >>>>          m += delta / (i+1)
> >>>>          variance += delta * (a[i] - m)
> >>>>      variance /= len(a)
> >>>>      return variance
> 
> I'm going to go ahead and attach a module containing the versions of 
> mean, var, etc that I've been playing with in case someone wants to mess 
> with them. Some were stolen from traffic on this list, for others I 
> grabbed the algorithms from wikipedia or equivalent.

I looked into this a bit more. I checked float32 (single precision) and
float64 (double precision), using long doubles (float96) for the "exact"
results. This is based on your code. Results are compared using
abs(exact_stat - computed_stat) / max(abs(values)), with 10000 values in the
range of [-100, 900]

First, the mean. In float32, the Kahan summation in single precision is
better by about 2 orders of magnitude than simple summation. However,
accumulating the sum in double precision is better by about 9 orders of
magnitude than simple summation (7 orders more than Kahan).

In float64, Kahan summation is the way to go, by 2 orders of magnitude.

For the variance, in float32, Knuth's method is *no better* than the two-pass
method. Tim's code does an implicit conversion of intermediate results to
float64, which is why he saw a much better result. The two-pass method using
Kahan summation (again, in single precision), is better by about 2 orders of
magnitude. There is practically no difference when using a double-precision
accumulator amongst the techniques: they're all about 9 orders of magnitude
better than single-precision two-pass.

In float64, Kahan summation is again better than the rest, by about 2 orders
of magnitude.

I've put my adaptation of Tim's code, and box-and-whisker plots of the
results, at http://arbutus.mcmaster.ca/dmc/numpy/variance/

Conclusions:

- If you're going to calculate everything in single precision, use Kahan
summation. Using it in double-precision also helps.
- If you can use a double-precision accumulator, it's much better than any of
the techniques in single-precision only.

- for speed+precision in the variance, either use Kahan summation in single
precision with the two-pass method, or use double precision with simple
summation with the two-pass method. Knuth buys you nothing, except slower
code :-)

After 1.0 is out, we should look at doing one of the above.

-- 
|>|\/|<
/--------------------------------------------------------------------------\
|David M. Cooke                      http://arbutus.physics.mcmaster.ca/dmc/
|cookedm at physics.mcmaster.ca