[SciPy-Dev] boolean / real-value distance metrics
Jacob VanderPlas
vanderplas at astro.washington.edu
Fri Jan 6 01:37:52 EST 2012
Hi all,
I've been taking a closer look at the various metrics in
scipy.spatial.distance. In particular, every metric designed for
boolean values behaves differently depending on whether the function is
used directly, or cdist/pdist is used (see the example below).
cdist/pdist first converts the float array to bool, then performs the
computation. The calls to the metric functions work directly with the
floating point vectors and yield a different result.
I've poked around, and haven't found any documentation anywhere that
addresses this:
Is this a feature of scipy, or a bug? Which behavior is correct in this
case?
Are these boolean metrics, when generalized to floating point, true
metrics? That is, can it be shown that they satisfy the triangle equality?
I'd like to work on the documentation to make all of this more clear,
but I don't know where to start... Thanks
Jake
Example code:
In [1]: from scipy.spatial.distance import cdist, yule
In [2]: import numpy as np
In [3]: np.random.seed(0)
In [4]: x = np.random.random(100)
In [5]: x[x>0.5] = 0 # set ~half the entries to zero
In [6]: y = np.random.random(100)
In [7]: y[y>0.5] = 0 # set half of entries to zero
In [8]: yule(x, y) # direct computation: this does not convert to bool
Out[8]: 0.96988390020367443
In [9]: cdist([x], [y], 'yule')[0, 0] # cdist computation: this does
convert to bool
Out[9]: 0.83211678832116787
More information about the SciPy-Dev
mailing list