Treatment of NANs in the statistics module

Léo El Amri leo at superlel.me
Sat Mar 17 05:52:11 EDT 2018


On 17/03/2018 00:16, Steven D'Aprano wrote:
> The bug tracker currently has a discussion of a bug in the median(), 
> median_low() and median_high() functions that they wrongly compute the 
> medians in the face of NANs in the data:
> 
> https://bugs.python.org/issue33084
> 
> I would like to ask people how they would prefer to handle this issue:

TL;DR: I choose (5)

I'm agree with Terry Reedy for his proposal for the (5), however, I want
to define precisely what we mean with "ignore".
In my opinion "ignoring" should be more like "stripping". In the case
the number of data points is odd, we can return a NAN without any
concerns. But in the case the number of data points is even, and at
least one of the two middle values is a NAN, we're probably going to
have an exception raised. In this case, to not over-complicate things, I
think we should go with this meaning for "ignore": "Removing" NAN before
actual data points processing. In this case, we should have two possible
options for the keyword argument "nan": 'strip' (Which does what I just
described) and 'raise' (Which raises an exception if there is a NAN in
the data points).
We should still consider adding an "ignore" option in a later time. This
option would blindly ignore NAN values. If an exception is encountered
during the actual processing (Let's say we have an even number of data
points, and a NAN in one of the two values), it is raised up to the caller.

>From my point of view, I prefer the (5). With a default of 'strip'. Your
argument with (1) being the fastest (I believe, in terms of
running-time, tell me if I'm wrong) can be achieved with the 'ignore'
option.

Going with (1) would force Python developers to write implementation
specific code (Oh rather "implementation-defined-prone" code). In this
case (5) goes easy with Python-side code.

Options from (2) to (4) force Python developers to adopt a behavior.
It's not necessarily a bad thing, but since (5) allows flexibility at no
cost I don't see why we shouldn't go with it.


--
Léo El Amri



More information about the Python-list mailing list