[SciPy-Dev] Proposal for a new function nanpdist that treats NaNs as missing values

Moritz Beber moritz.beber at gmail.com
Thu Aug 14 05:33:52 EDT 2014


Answers to Jaime's post:

Warren has already pointed this out, but let me insist: what is nanpdist,
> or the nan keyword expected to do? Treat pairs of vectors with NaNs as
> lower dimensional, removing pairs of entries where either is NaN? Do those
> results make any real sense? Thinking of euclidean distance for points in
> 3D space, I have trouble thinking of a practical situation where "if any Z
> coordinate is missing, just give me the distance of the projections onto
> the XY plane" would be anything but a misleading result. I presume the case
> is different for all those other distances I have never needed to use, so I
> am just curious of the use case.
>

Please see my answer to Warren about the use-case. In three dimensions this
would certainly not make sense but my use-case has over three thousand
dimensions. What I have in mind is a scaling factor for distance metrics,
as suggested before, and an appropriate consideration of dissimilarity of
the missing coordinate in similarity measures.


>
> Looking at your linked post, from an implementation point of view, at the
> low level function that is actually going to do the heavy lifting, it is
> probable better to, rather than hardcode a check for NaN-ness, take a
> 'where' kwarg, as numpy ufuncs already do (
> http://docs.scipy.org/doc/numpy/reference/ufuncs.html#optional-keyword-arguments),
> and build the masking array in a higher level wrapper. This would make it
> easier to eventually make this functionality work with masked arrays or the
> like.
>

I'd be perfectly happy to do so. The hard-coded check is inspired by
bottleneck which does exactly that for all its nan* functions. But I agree
that a mask is preferable.


>
> As a separate but related issue, I have had this PR open for almost a year
> now, https://github.com/scipy/scipy/pull/3163, and although me saying I
> want to complete it is getting old, hopefully whatever you have in mind can
> fit with the general structure of that.
>

I haven't fully grasped your code in umath_distance.c.src but that's
probably a separate discussion. I also couldn't tell if some of that code
is automatically generated or all written by hand.


>
> Lastly, whatever you go for, I don't think you should do anything to pdist
> that you don't also do for cdist and the individual distance functions.
>
>
Noted and agreed.

Best,
Moritz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20140814/ca0b4d25/attachment.html>


More information about the SciPy-Dev mailing list