[Python-ideas] statistics module in Python3.4

Sat Feb 1 21:47:14 CET 2014

Oscar Benjamin <oscar.j.benjamin at ...> writes:

Hi Oscar,
and thanks for this very detailed post.

> 
> You're making this sound a lot more complicated than it is. The
> problem is simple: Decimal doesn't integrate with the numeric tower.
> This is explicit in the PEP that brought in the numeric tower:
> http://www.python.org/dev/peps/pep-3141/#the-decimal-type
> 

You're perfectly right about this as far as built-in number types and the
standard library types Fraction and Decimal are concerned.

> That being said I think that guaranteeing an error is
> better than the current order-dependent behaviour (and agree that that
> should be considered a bug).
> 

For custom types, the type returned by _sum can also be order-dependent due
to this part in _coerce-types:

def _coerce_types(T1, T2):
[..]
    if issubclass(T2, float): return T2
    if issubclass(T1, float): return T1
    # Subclasses of the same base class give priority to the second.
    if T1.__base__ is T2.__base__: return T2

I chose the more drastic example with Fraction and Decimal for my initial
post because there the difference is between a result and an error, but the
above may illustrate better why I said that the returned type of _sum is
hard to predict.

> If there is to be a more drastic rearrangement of the _sum function
> then it should actually be to solve the problem that the current
> implementation of mean, variance etc. uses Fractions for all the heavy
> lifting but then rounds in the wrong place (when returning from
> _sum()) rather than in the mean, variance function itself.
> 

This is an excellent remark and I agree absolutely with your point here.
It's one of the aspects of the statistics module that I pondered over for
weeks.
Essentially, the fact that all current functions that rely on _sum do round
imprecisely anyway was my motivation for suggesting the simple:

def _coerce_types (types):
    if len(types) == 1:
        return next(iter(types))
    return float

because it certainly makes sense to return the type found in the input if
there is only one, but with ambiguity, why make the effort of guessing when
it does not help precision anyway. However, I realized that I probably
rushed this because the implementation of functions that call _sum may
change later to rely on an exact return value.

> The clever algorithm in the variance function (unless it changed since
> I last looked) is entirely unnecessary when all of the intensive
> computation is performed with exact arithmetic. In the absence of
> rounding error you could compute a perfectly good variance using the
> computational formula for variance in a single pass. Similarly
> although the _sum() function is correctly rounded, the mean() function
> calls _sum() and then rounds again so that the return value from
> mean() is rounded twice. _sum() computes an exact value as a fraction
> and then coerces it with
> 
>     return T(total_numerator) / total_denominator
> 
> so that the division causes it to be correctly rounded. However the
> mean function effectively ends up doing
> 
>      return (T(total_numerator) / total_denominator) / num_items
> 
> which uses 2 divisions and hence rounds twice. It's trivial to
> rearrange that so that you round once
> 
>     return T(total_numerator) / (total_denominator * num_items)
> 
> except that to do this the _sum function should be changed to return
> the exact result as a Fraction (and perhaps the type T). Similar
> changes would need to be made to the some of squares function (_ss()
> IIRC). The double rounding in mean() isn't a big deal but the
> corresponding effect for the variance functions is significant. It was
> after realising this that the sum function was renamed _sum and made
> nominally private.
> 

I have been thinking about this solution as well, but I think you really
have to return a tuple of the sum as a Fraction and the type (not perhaps)
since it would be really weird if the public functions in statistics always
return a Fraction even if the input sequence consisted of only one standard
type like int, float or Decimal. The obvious criticism then is that such a
_sum is not really a sum function anymore like the existing ones. Then
again, since this is a module private function it may be ok to do this?

Best,
Wolfgang