None versus MISSING sentinel -- request for design feedback

Fri Jul 15 13:02:08 EDT 2011

On Thu, Jul 14, 2011 at 11:28 PM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> Hello folks,
>
> I'm designing an API for some lightweight calculator-like statistics
> functions, such as mean, standard deviation, etc., and I want to support
> missing values. Missing values should be just ignored. E.g.:
>
> mean([1, 2, MISSING, 3]) => 6/3 = 2 rather than 6/4 or raising an error.
>
> My question is, should I accept None as the missing value, or a dedicated
> singleton?
>
> In favour of None: it's already there, no extra code required. People may
> expect it to work.
>
> Against None: it's too easy to mistakenly add None to a data set by mistake,
> because functions return None by default.

Good point.

>
> In favour of a dedicated MISSING singleton: it's obvious from context. It's
> not a lot of work to implement compared to using None. Hard to accidentally
> include it by mistake. If None does creep into the data by accident, you
> get a nice explicit exception.

Also good points.

>
> Against MISSING: users may expect to be able to choose their own sentinel by
> assigning to MISSING. I don't want to support that.
>
>
> I've considered what other packages do:-
>
> R uses a special value, NA, to stand in for missing values. This is more or
> less the model I wish to follow.
>
> I believe that MATLAB treats float NANs as missing values. I consider this
> an abuse of NANs and I won't be supporting that :-P

I was just thinking of this.  :)

>
> Spreadsheets such as Excel, OpenOffice and Gnumeric generally ignore blank
> cells, and give you a choice between ignoring text and treating it as zero.
> E.g. with cells set to [1, 2, "spam", 3] the AVERAGE function returns 2 and
> the AVERAGEA function returns 1.5.
>
> numpy uses masked arrays, which is probably over-kill for my purposes; I am
> gratified to see it doesn't abuse NANs:
>
>>>> import numpy as np
>>>> a = np.array([1, 2, float('nan'), 3])
>>>> np.mean(a)
> nan
>
> numpy also treats None as an error:
>
>>>> a = np.array([1, 2, None, 3])
>>>> np.mean(a)
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  File "/usr/lib/python2.5/site-packages/numpy/core/fromnumeric.py", line
> 860, in mean
>    return mean(axis, dtype, out)
> TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
>
>
> I would appreciate any comments, advice or suggestions.
>

Too bad there isn't a good way to "freeze" a name, i.e. indicate that
any attempt to rebind it is an exception.  Trying to rebind None is a
SyntaxError, but a NameError or something would be fine.  Then the
downside of using your own sentinel here goes away.

In reality, using Missing may be your best bet anyway.  If there were
a convention for indicating a name should not be re-bound (like a
single leading underscore indicates "private"), you could use that
(all caps?).  Since "we're all consenting adults" it would probably be
good enough to make sure others know that Missing should not be
re-bound...

I might have said to use NotImplemented instead of None, but it can be
re-bound and the name isn't as helpful for your use case.

Another solution, perhaps ugly or confusing, is to use something like
two underscores as the name for your sentinel:

mean([1, 2, __, 3])

Still it seems like using Missing (or whatever) would be better than None.

-eric

>
> --
> Steven
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>