NaN comparisons - Call For Anecdotes

Johann Hibschman jhibschman at gmail.com
Thu Jul 17 14:49:15 EDT 2014


Chris Angelico <rosuav at gmail.com> writes:

> But you also don't know that he hasn't. NaN doesn't mean "unknown", it
> means "Not a Number". You need a more sophisticated system that allows
> for uncertainty in your data.

Regardless of whether this is the right design, it's still an example of
use.

As to the design, using NaN to implement NA is a hack with a long
history, see

      http://www.numpy.org/NA-overview.html

for some color.  Using NaN gets us a hardware-accelerated implementation
with just about the right semantics.  In a real example, these lists are
numpy arrays with tens of millions of elements, so this isn't a trivial
benefit.  (Technically, that's what's in the database; a given analysis
may look at a sample of 100k or so.)

> You have a special business case here (the need to
> record information with a "maybe" state), and you need to cope with
> it, which means dedicated logic and planning and design and code.

Yes, in principle.  In practice, everyone is used to the semantics of
R-style missing data, which are reasonably well-matched by nan.  In
principle, (NA == 1.0) should be a NA (missing) truth value, as should
(NA == NA), but in practice having it be False is more useful.  As an
example, indexing R vectors by a boolean vector containing NA yields NA
results, which is a feature that I never want.

Cheers,
Johann



More information about the Python-list mailing list