None versus MISSING sentinel -- request for design feedback

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri Jul 15 01:28:41 EDT 2011


Hello folks,

I'm designing an API for some lightweight calculator-like statistics
functions, such as mean, standard deviation, etc., and I want to support
missing values. Missing values should be just ignored. E.g.:

mean([1, 2, MISSING, 3]) => 6/3 = 2 rather than 6/4 or raising an error.

My question is, should I accept None as the missing value, or a dedicated
singleton?

In favour of None: it's already there, no extra code required. People may
expect it to work.

Against None: it's too easy to mistakenly add None to a data set by mistake,
because functions return None by default.

In favour of a dedicated MISSING singleton: it's obvious from context. It's
not a lot of work to implement compared to using None. Hard to accidentally
include it by mistake. If None does creep into the data by accident, you
get a nice explicit exception.

Against MISSING: users may expect to be able to choose their own sentinel by
assigning to MISSING. I don't want to support that.


I've considered what other packages do:-

R uses a special value, NA, to stand in for missing values. This is more or
less the model I wish to follow.

I believe that MATLAB treats float NANs as missing values. I consider this
an abuse of NANs and I won't be supporting that :-P

Spreadsheets such as Excel, OpenOffice and Gnumeric generally ignore blank
cells, and give you a choice between ignoring text and treating it as zero.
E.g. with cells set to [1, 2, "spam", 3] the AVERAGE function returns 2 and
the AVERAGEA function returns 1.5.

numpy uses masked arrays, which is probably over-kill for my purposes; I am
gratified to see it doesn't abuse NANs:

>>> import numpy as np
>>> a = np.array([1, 2, float('nan'), 3])
>>> np.mean(a)
nan

numpy also treats None as an error:

>>> a = np.array([1, 2, None, 3])
>>> np.mean(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/site-packages/numpy/core/fromnumeric.py", line
860, in mean
    return mean(axis, dtype, out)
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'


I would appreciate any comments, advice or suggestions. 


-- 
Steven




More information about the Python-list mailing list