[Python-ideas] NAN handling in the statistics module

Mon Jan 7 02:05:26 EST 2019

On Sun, Jan 06, 2019 at 07:40:32PM -0800, Stephan Hoyer wrote:
> On Sun, Jan 6, 2019 at 4:27 PM Steven D'Aprano <steve at pearwood.info> wrote:
> 
> > I propose adding a "nan_policy" keyword-only parameter to the relevant
> > statistics functions (mean, median, variance etc), and defining the
> > following policies:
> >
> >     IGNORE:  quietly ignore all NANs
> >     FAIL:  raise an exception if any NAN is seen in the data
> >     PASS:  pass NANs through unchanged (the default)
> >     RETURN:  return a NAN if any NAN is seen in the data
> >     WARN:  ignore all NANs but raise a warning if one is seen
> >
> 
> I don't think PASS should be the default behavior, and I'm not sure it
> would be productive to actually implement all of these options.

I'm not wedded to the idea that the default ought to be the current 
behaviour. If there is a strong argument for one of the others, I'm 
listening.

> For reference, NumPy and pandas (the two most popular packages for data
> analytics in Python) support two of these modes:
> - RETURN (numpy.mean() and skipna=False for pandas)
> - IGNORE (numpy.nanmean() and skipna=True for pandas)
> 
> RETURN is the default behavior for NumPy; IGNORE is the default for pandas.
> 
> I'm pretty sure RETURN is the right default behavior for Python's standard
> library and anything else should be considered a bug. It safely propagates
> NaNs, along the lines of IEEE float behavior.

How would you answer those who say that the right behaviour is not to 
propogate unwanted NANs, but to fail fast and raise an exception?

> I'm not sure what the use cases are for PASS, FAIL, or WARN, none of which
> are supported by NumPy or pandas:
> - PASS is a license to return silently incorrect results, in return for
> very marginal performance benefits.

By my (very rough) preliminary testing, the cost of checking for NANs 
doubles the cost of calculating the median, and increases the cost of 
calculating the mean() by 25%.

I'm not trying to compete with statistics libraries written in C for 
speed, but that doesn't mean I don't care about performance at all. The 
statistics library is already slower than I like and I don't want to 
slow it down further for the common case (numeric data with no NANs) for 
the sake of the uncommon case (data with NANs).

But I hear you about the "return silently incorrect results" part.

Fortunately, I think that only applies to sort-based functions like 
median(). mean() etc ought to propogate NANs with any reasonable 
implementation, but I'm reluctant to make that a guarantee in case I 
come up with some unreasonable implementation :-)

> This seems at odds with the intended
> focus of the statistics module on correctness over speed. Returning
> incorrect statistics should not be considered a feature that needs to be
> maintained.

It is only incorrect because the data violates the documented 
requirement that it be *numeric data*, and the undocumented requirement 
that the numbers have a total order. (So complex numbers are out.) I 
admit that the docs could be improved, but there are no guarantees made 
about NANs.

This doesn't mean I don't want to improve the situation! Far from it, 
hence this discussion.

> - FAIL would make sense if statistics functions could introduce *new* NaN
> values. But as far as I can tell, statistics functions already raise
> StatisticsError in these cases (e.g., if zero data point are provided). If
> users are concerned about accidentally propagating NaNs, they should be
> encouraged to check for NaNs at the entry points of their code.

As far as I can tell, there are two kinds of people when it comes to 
NANs: those who think that signalling NANs are a waste of time and NANs 
should always propogate, and those who hate NANs and wish that they 
would always signal (raise an exception).

I'm not going to get into an argument about who is right or who is 
wrong.

> - WARN is even less useful than FAIL. Seriously, who likes warnings?

Me :-)

> NumPy
> uses this approach for in array operations that produce NaNs (e.g., when
> dividing by zero), because *some* but not all results may be valid. But
> statistics functions return scalars.
> 
> I'm not even entirely sure it makes sense to add the IGNORE option, or at
> least to add it only for NaN. None is also a reasonable sentinel for a
> missing value in Python, and user defined types (e.g., pandas.NaT) also
> fall in this category. It seems a little strange to single NaN out in
> particular.

I am considering adding support for a dedicated "missing" value, whether 
it is None or a special sentinel. But one thing at a time. Ignoring NANs 
is moderately common in other statistics libraries, and although I 
personally feel that NANs shouldn't be used for missing values, I know 
many people do so.

-- 
Steve