[SciPy-Dev] boolean / real-value distance metrics

Fri Jan 6 04:35:02 EST 2012

On Fri, Jan 6, 2012 at 12:37 AM, Jacob VanderPlas <
vanderplas at astro.washington.edu> wrote:

> Hi all,
> I've been taking a closer look at the various metrics in
> scipy.spatial.distance.  In particular, every metric designed for
> boolean values behaves differently depending on whether the function is
> used directly, or cdist/pdist is used (see the example below).
> cdist/pdist first converts the float array to bool, then performs the
> computation.  The calls to the metric functions work directly with the
> floating point vectors and yield a different result.
>
> I've poked around, and haven't found any documentation anywhere that
> addresses this:
> Is this a feature of scipy, or a bug?  Which behavior is correct in this
> case?
> Are these boolean metrics, when generalized to floating point, true
> metrics?  That is, can it be shown that they satisfy the triangle equality?
>
> I'd like to work on the documentation to make all of this more clear,
> but I don't know where to start...  Thanks
>   Jake
>
> Example code:
>
> In [1]: from scipy.spatial.distance import cdist, yule
>
> In [2]: import numpy as np
>
> In [3]: np.random.seed(0)
>
> In [4]: x = np.random.random(100)
>
> In [5]: x[x>0.5] = 0 # set ~half the entries to zero
>
> In [6]: y = np.random.random(100)
>
> In [7]: y[y>0.5] = 0  # set half of entries to zero
>
> In [8]: yule(x, y)  # direct computation: this does not convert to bool
> Out[8]: 0.96988390020367443
>
> In [9]: cdist([x], [y], 'yule')[0, 0]  # cdist computation: this does
> convert to bool
> Out[9]: 0.83211678832116787
>
>

The boolean dissimilarity functions (such as yule) expect either boolean
arrays or numeric arrays of 0 and 1.  They are not meant to be generalized
to arrays of arbitrary floating point values.  This is not documented (as
far as I can tell), but it can be inferred from, for example, the
_nbool_correspond_ft_tf function, which is used by some of the
dissilimilarity functions:

def _nbool_correspond_ft_tf(u, v):
    if u.dtype == np.int or u.dtype == np.float_ or u.dtype == np.double:

        not_u = 1.0 - u
        not_v = 1.0 - v
        nft = (not_u * v).sum()
        ntf = (u * not_v).sum()
    else:
        not_u = ~u
        not_v = ~v
        nft = (not_u & v).sum()
        ntf = (u & not_v).sum()

    return (nft, ntf)

Note that for a floating point array, not_u is computed as 1.0 - u.

Any improvement of the documentation would certainly be welcome!

Likewise for the code: that test for the dtype of u misses many of the
numeric data types, and the check for np.float_ and np.double is
redundant, since these are both just different names for np.float64.

The separate dissimilarity functions such as yule are implemented

in python, while cdist is a wrapper for C code.  The C functions require
a specific data type for their arrays, which is (presumably) why cdist
converts to boolean first.  Instead of having a separate calculation for

bool and non-bool arrays, perhaps the dissimilarity functions should
do the same as cdist and simply convert non-bool arrays to boolean.
This would make them consistent with cdist.

Warren
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20120106/6c428f68/attachment.html>