[SciPy-dev] [Numpy-discussion] Proposal: scipy.spatial

Sat Oct 4 08:57:23 EDT 2008

2008/10/4 Prabhu Ramachandran <prabhu at aero.iitb.ac.in>:
> Hi,
>
> Nathan Bell wrote:
>> On Wed, Oct 1, 2008 at 12:53 PM, Prabhu Ramachandran
>> <prabhu at aero.iitb.ac.in> wrote:
>>> I thought this was a discussion on a good API for such an algorithm
>>> (sorry if I misunderstood!).  Given that context, if this is a general
>>> purpose algorithm I'd rather not assume that the number of results is
>>> typically going to be small.
>>
>> However, you *can* assume log(k) of a k-nearest neighbor query is
>> going to be small.
>
> True.
>
>>> If there is no need for sorting and there are a large number of nearest
>>> neighbor queries (in my case there are O(N) nearest neighbor queries),
>>> this quickly adds up to an unnecessary O(N) inefficiency that I'd like
>>> to avoid.
>>
>> Although you might not sort a k-nearest neighbor query, but as Anne
>> has said, you end up needing a sorted structure (like a priority
>> queue) to perform the query itself anyway.
>
> In the context of a kd-tree, yes, but not in other contexts like the "in
> sphere" query you mention below.
>
>> The only case where not sorting makes sense is an "in sphere" query
>> where there's no need to prioritize the results.  In this case, I'd
>> still argue that the overhead of sorting isn't likely to matter.
>
> Well, in the case of the kd-tree I understand that this is not likely to
> matter.  I was under the mistaken impression that this was to be a
> common API.  I re-read the thread and understand now that this is a
> first cut for just the KDTree.
>
> Personally, I have a need for finding the exact nearest neighbors within
> an epsilon ball (in 1, 2, and 3D) of *all* the nodes.  Typically epislon
> is known (or at least bounded) a-priori.  In this context I typically
> build a binary/quad/oct tree and limit the size of the leaves to around
> epsilon. I avoid tree traversals by iterating over the leaves in order
> to get to the points they contain -- note that the target point is
> always contained in the tree which is why I can do this.   Anyway, in
> this context sorting the results ends up in O(N klog(k)) operations if
> you have k neighbors on the average.  This is expensive.
>
> In any case, I am not sure there is a general need for this kind of data
> structure.  I need it a whole lot.  If there is a general need I'll be
> glad to try and extract any of my code and contribute it to scipy.spatial.

There does seem to be a general need for a two-tree query that
extracts (or just counts) all items of one tree within a ball centered
at each item of another tree. You're right, too, that "in a ball"
queries often don't naturally produce sorted results, or even known
distances. But I think this has to be a different query. In fact, even
for single-point queries it may make sense to have a separate query
function for in-a-ball queries.

Anne