Treatment of NANs in the statistics module

Fri Mar 16 22:08:42 EDT 2018

On 3/16/2018 7:16 PM, Steven D'Aprano wrote:
> The bug tracker currently has a discussion of a bug in the median(),
> median_low() and median_high() functions that they wrongly compute the
> medians in the face of NANs in the data:
> 
> https://bugs.python.org/issue33084
> 
> I would like to ask people how they would prefer to handle this issue:
> 
> (1) Put the responsibility on the caller to strip NANs from their data.

1 to 3 all put responsibility on the caller to strip NANs to get a sane 
answer.  The question is what to do if the caller does not

(1)
> If there is a NAN in your data, the result of calling median() is
> implementation-defined. This is the current behaviour, and is likely to
> be the fastest.

I hate implementation-defined behavior.

> (2) Return a NAN.

I don't like NANs as implemented and used, or unused.

> (3) Raise an exception.

That leave this.

> (4) median() should strip out NANs.

and then proceed in a deterministic fashion to give an answer.

> (5) All of the above, selected by the caller. (In which case, which would
> you prefer as the default?)

I would frame this as an alternative: 'ignore_nan=False (3) or =True 
(4).  Or nan='ignore' versus 'raise' (or 'strict')  These are like the 
choices encoding.

What do statistics.mean() and other functions do? The proposed 
quantile() will have the same issue.

BMDP and other packages had and have general options for dealing with 
missing values, and that is what NAN is.

-- 
Terry Jan Reedy