[SciPy-user] Faster allclose, comparing arrays

Sat Feb 2 14:36:39 EST 2008

On 01/02/2008, Tom Johnson <tjhnson at gmail.com> wrote:
> Frequently, I am comparing vectors to one another, and I am finding
> that a good portion of my time is spent in allclose.  This is done
> through a class which stores a bunch of metadata and also a 'vector'
> attribute:
>
> def __eq__(self, other):
>       return allclose(self.vector, other.vector, rtol=RTOL, atol=ATOL)
>
> So I place these objects in a list and check equality via "if x in
> objlist".   Couple of questions:
>
> 1) Is this a "good" method for comparing arrays?
> 2) Is there any way to speed up allclose?

That rather depends on what you want "comparing" to do. If you want
exact equality, then allclose is doing the wrong thing; you want
something like N.all(a==b). But I suspect you know that. There are
potentially more subtle problems with allclose(), though.

For a simple example, you can easily have N.allclose(a,b) but not
N.allclose(1e3*a,1e3*b).

For a more subtle example, suppose you want to compare a vector and a
result obtained by Fourier transforming. If your vector is something
like [1,2,3,4] allclose() will do pretty much what you want. But if
your vector is something like [1e40,0,0,0], you might have a problem:
the Fourier transform can be expected to introduce numerical errors in
all the components of size about machine epsilon times the *largest
component*. Since allclose() does an element-wise comparison, if you
get [1e40+1,1,1], allclose returns False when the answer is true to
numerical accuracy. On the other hand, sometimes the different
elements of a vector have wildly differing sizes by design, so
normalizing by the largest vector isn't what you want.

I think of allclose() as a debugging function; if I want my code's
result to depend on how close two vectors are, I write out explicitly
what I mean by "close".

How can you go faster? Well, that depends on whether you want
allclose()'s semantics or something else. If you want real equality,
all(a==b) will be slightly faster, but there are various hacks you can
pull off - like making a python set of .tostring() values (once the
arrays are put in some normalized form). You might be able to
accelerate allclose()-like functions by writing a compiled bit of code
that compares entry-by-entry and bails out as soon as any entry
differs - for many applications that'll bring the cost down close to
that of a single float comparison, on average.

If you have *lots* of vectors, and you're looking for one close enough
to yours, you can do better than simply trying all candidates. This is
the domain of spatial data structures, and it's a huge topic (and my
knowledge is quite out-of-date). But, for a simple example, you can
arrange the vectors to be tested against on the leaves of a tree. Each
node in the tree specifies a coordinate (the 7th, for example) and a
value v; all arrays with a[7]>v go on one branch, and all with a[7]<=v
go on the other. v can be chosen so that half the arrays go on either
side. Then when searching for a test array in the collection, you can
almost always test only one branch.  (This is roughly a kd-tree; you
can find much more information on this topic with a bit of googling.)

Anne