[SciPy-Dev] boolean / real-value distance metrics

Jacob VanderPlas vanderplas at astro.washington.edu
Fri Jan 6 01:37:52 EST 2012


Hi all,
I've been taking a closer look at the various metrics in 
scipy.spatial.distance.  In particular, every metric designed for 
boolean values behaves differently depending on whether the function is 
used directly, or cdist/pdist is used (see the example below).  
cdist/pdist first converts the float array to bool, then performs the 
computation.  The calls to the metric functions work directly with the 
floating point vectors and yield a different result.

I've poked around, and haven't found any documentation anywhere that 
addresses this:
Is this a feature of scipy, or a bug?  Which behavior is correct in this 
case?
Are these boolean metrics, when generalized to floating point, true 
metrics?  That is, can it be shown that they satisfy the triangle equality?

I'd like to work on the documentation to make all of this more clear, 
but I don't know where to start...  Thanks
   Jake

Example code:

In [1]: from scipy.spatial.distance import cdist, yule

In [2]: import numpy as np

In [3]: np.random.seed(0)

In [4]: x = np.random.random(100)

In [5]: x[x>0.5] = 0 # set ~half the entries to zero

In [6]: y = np.random.random(100)

In [7]: y[y>0.5] = 0  # set half of entries to zero

In [8]: yule(x, y)  # direct computation: this does not convert to bool
Out[8]: 0.96988390020367443

In [9]: cdist([x], [y], 'yule')[0, 0]  # cdist computation: this does 
convert to bool
Out[9]: 0.83211678832116787




More information about the SciPy-Dev mailing list