NaN comparisons - Call For Anecdotes

Fri Jul 18 13:57:03 EDT 2014

On Fri, 18 Jul 2014 01:36:24 +1000, Chris Angelico wrote:

> On Fri, Jul 18, 2014 at 1:12 AM, Johann Hibschman <jhibschman at gmail.com>
> wrote:
>> Well, I just spotted this thread.  An easy example is, well, pretty
>> much any case where SQL NULL would be useful.  Say I have lists of
>> borrowers, the amount owed, and the amount they paid so far.
>>
>>     nan = float("nan")
>>     borrowers = ["Alice", "Bob", "Clem", "Dan"] amount_owed = [100.0,
>>     nan, 200.0, 300.0] amount_paid = [100.0, nan, nan, 200.0]
>>     who_paid_off = [b for (b, ao, ap) in
>>                           zip(borrowers, amount_owed, amount_paid)
>>                       if ao == ap]
>>
>> I want to just get Alice from that list, not Bob.  I don't know how
>> much Bow owes or how much he's paid, so I certainly don't know that
>> he's paid off his loan.
>>
>>
> But you also don't know that he hasn't. NaN doesn't mean "unknown", it
> means "Not a Number". You need a more sophisticated system that allows
> for uncertainty in your data. I would advise using either None or a
> dedicated singleton (something like `unknown = object()` would work, or
> you could make a custom type with a more useful repr)

Hmmm, there's something to what you say there, but IEEE-754 NANs seem to 
have been designed to do quadruple (at least!) duty with multiple 
meanings, including:

- Missing values ("I took a reading, but I can't read my handwriting").

- Data known only qualitatively, not quantitatively (e.g. windspeed =
  "fearsome").

- Inapplicable values, e.g. the average depth of the oceans on Mars.

- The result of calculations which are mathematically indeterminate,
  such as 0/0.

- The result of real-valued calculations which are invalid due to
  domain errors, such as sqrt(-1) or acos(2.5).

- The result of calculations which are conceptually valid, but are
  unknown due to limitations of floats, e.g. you have two finite
  quantities which have both overflowed to INF, the difference
  between them ought to be finite, but there's no way to tell what
  it should be.

It seems to me that the way you treat a NAN will often depend on which 
category it falls under. E.g. when taking the average of a set of values, 
missing values ought to be skipped over, while actual indeterminate NANs 
ought to carry through:

    average([1, 1, 1, Missing, 1]) => 1
    average([1, 1, 1, 0/0, 1]) => NAN

I know that R distinguishes between NA and IEEE-754 NANs, although I'm 
not sure how complete its support for NANs is. But many (most?) R 
functions take an argument controlling whether or not to ignore NA values.

In principle, you can encode the different meanings into NANs using the 
payload. There are 9007199254740988 possible Python float NANs. Half of 
these are signalling NANs, half are quiet NANs. Ignoring the sign bit 
leaves us with 2251799813685247 distinct sNANs and the same qNANs. That's 
enough to encode a *lot* of different meanings.

[Aside: I find myself perplexed why IEEE-754 says that the sign bit of 
NANs should be ignored, but then specifies that another bit is to be used 
to distinguish signalling from quiet NANs. Why not just interpret NANs 
with the sign bit set are signalling, those with it clear are quiet?]

-- 
Steven