[SciPy-Dev] computing pairwise distance of vectors with missing (nan) values

Moritz Emanuel Beber moritz.beber at gmail.com
Mon Jul 21 04:09:02 EDT 2014


Dear all,

My basic problem is that I would like to compute distances between 
vectors with missing values. You can find more detail in my question on 
SO 
(http://stackoverflow.com/questions/24781461/compute-the-pairwise-distance-in-scipy-with-missing-values). 
Since it seems this is not directly possible with scipy at the moment, I 
started to Cythonize my function. Currently, the below function is not 
much faster than my pure Python implementation, so I thought I'd ask the 
experts here. *Note that even though I'm computing the euclidean 
distance, I'd like to make use of different distance metrics.

*
So my current attempt at Cythonizing is:

import numpy
cimport numpy
cimport cython
from numpy.linalg import norm

numpy.import_array()

@cython.boundscheck(False)
@cython.wraparound(False)
def masked_euclidean(numpy.ndarray[numpy.double_t, ndim=2] data):
     cdef Py_ssize_t m = data.shape[0]
     cdef Py_ssize_t i = 0
     cdef Py_ssize_t j = 0
     cdef Py_ssize_t k = 0
     cdef numpy.ndarray[numpy.double_t] dm = numpy.zeros(m * (m - 1) // 
2, dtype=numpy.double)
     cdef numpy.ndarray[numpy.uint8_t, ndim=2, cast=True] mask = 
numpy.isfinite(data) # boolean
     for i in range(m - 1):
         for j in range(i + 1, m):
             curr = numpy.logical_and(mask[i], mask[j])
             u = data[i][curr]
             v = data[j][curr]
             dm[k] = norm(u - v)
             k += 1
     return dm

Maybe the lack of speed-up is due to the Python function 'norm'? So my 
question is, how to improve the Cython implementation? Or is there a 
completely different way of approaching this problem?

Thanks in advance,
Moritz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20140721/4690454a/attachment.html>


More information about the SciPy-Dev mailing list