Treatment of NANs in the statistics module

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri Mar 16 23:13:34 EDT 2018


On Fri, 16 Mar 2018 22:08:42 -0400, Terry Reedy wrote:

> On 3/16/2018 7:16 PM, Steven D'Aprano wrote:
>> The bug tracker currently has a discussion of a bug in the median(),
>> median_low() and median_high() functions that they wrongly compute the
>> medians in the face of NANs in the data:
[...]

>> (4) median() should strip out NANs.
> 
> and then proceed in a deterministic fashion to give an answer.

Indeed.


>> (5) All of the above, selected by the caller. (In which case, which
>> would you prefer as the default?)
> 
> I would frame this as an alternative: 'ignore_nan=False (3) or =True
> (4).  Or nan='ignore' versus 'raise' (or 'strict')  These are like the
> choices encoding.

That's what I'm thinking. But which would you have as default? I'm 
guessing "raise".


> What do statistics.mean() and other functions do? 

Because they do actual arithmetic on the data points, the presence of a 
float NAN will propagate through to the end of the calculation.

A Decimal NAN will behave in three different ways:

- a signalling NAN will raise when any operation is performed on it;

- a quiet NAN will raise if the current Decimal context is set to raise 
on invalid operations, and propagate otherwise.

It seems reasonable for median() to handle NANs better than it currently 
does, in which case I'd expect the rest of the statistics module to do 
the same.


> The proposed quantile() will have the same issue.
> 
> BMDP and other packages had and have general options for dealing with
> missing values, and that is what NAN is.

I don't wish to get into an argument about whether NANs are missing 
values or could be missing values, but R supports both NANs and a 
dedicated NA ("not available") missing value. By default, either will 
cause median to return NA, but there is an option to ignore NANs:

> median(c(1, 2, 3, 4, NaN))
[1] NA
> median(c(1, 2, 3, 4, NaN), na.rm=TRUE)
[1] 2.5



-- 
Steve




More information about the Python-list mailing list