[SciPy-Dev] computing pairwise distance of vectors with missing (nan) values
Moritz Emanuel Beber
moritz.beber at gmail.com
Mon Jul 21 04:09:02 EDT 2014
Dear all,
My basic problem is that I would like to compute distances between
vectors with missing values. You can find more detail in my question on
SO
(http://stackoverflow.com/questions/24781461/compute-the-pairwise-distance-in-scipy-with-missing-values).
Since it seems this is not directly possible with scipy at the moment, I
started to Cythonize my function. Currently, the below function is not
much faster than my pure Python implementation, so I thought I'd ask the
experts here. *Note that even though I'm computing the euclidean
distance, I'd like to make use of different distance metrics.
*
So my current attempt at Cythonizing is:
import numpy
cimport numpy
cimport cython
from numpy.linalg import norm
numpy.import_array()
@cython.boundscheck(False)
@cython.wraparound(False)
def masked_euclidean(numpy.ndarray[numpy.double_t, ndim=2] data):
cdef Py_ssize_t m = data.shape[0]
cdef Py_ssize_t i = 0
cdef Py_ssize_t j = 0
cdef Py_ssize_t k = 0
cdef numpy.ndarray[numpy.double_t] dm = numpy.zeros(m * (m - 1) //
2, dtype=numpy.double)
cdef numpy.ndarray[numpy.uint8_t, ndim=2, cast=True] mask =
numpy.isfinite(data) # boolean
for i in range(m - 1):
for j in range(i + 1, m):
curr = numpy.logical_and(mask[i], mask[j])
u = data[i][curr]
v = data[j][curr]
dm[k] = norm(u - v)
k += 1
return dm
Maybe the lack of speed-up is due to the Python function 'norm'? So my
question is, how to improve the Cython implementation? Or is there a
completely different way of approaching this problem?
Thanks in advance,
Moritz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20140721/4690454a/attachment.html>
More information about the SciPy-Dev
mailing list