[Numpy-discussion] numarray.where confusion

Thu May 27 10:48:00 EDT 2004

Francesc Alted va escriure:

> A Dimecres 26 Maig 2004 21:01, Perry Greenfield va escriure:
> > correct. You'd have to break apart the m1 tuple and
> > index all the components, e.g.,
> > 
> > m11, m12 = m1
> > x[m11[m2],m12[m2]] = ...
> > 
> > This gets clumsier with the more dimensions that must
> > be handled, but you still can do it. It would be most
> > useful if the indexed array is very large, the number
> > of items selected is relatively small and one
> > doesn't want to incur the memory overhead of all the
> > mask arrays of the admittedly much nicer notational
> > approach that Francesc illustrated.
> 
> Well, boolean arrays have the property that they use very little memory
> (only 1 byte / element), and normally perform quite well doing indexing.
> Some timings:
> 
> >>> import timeit
> >>> t1 = 
> timeit.Timer("m1=where(x>4);m2=where(x[m1]<7);m11,m12=m1;x[m11[m2]
> ,m12[m2]]","from numarray import 
> arange,where;dim=3;x=arange(dim*dim);x.shape=(dim,dim)")
> >>> t2 = timeit.Timer("x[(x>4) & (x<7)]","from numarray import 
> arange,where;dim=3;x=arange(dim*dim);x.shape=(dim,dim)")
> >>> t1.repeat(3,1000)
> [3.1320240497589111, 3.1235389709472656, 3.1198310852050781]
> >>> t2.repeat(3,1000)
> [1.1218469142913818, 1.117638111114502, 1.1156759262084961]
> 
> i.e. using boolean arrays for indexing is roughly 3 times faster.
> 
> For larger arrays this difference is even more noticeable:
> 
> >>> t3 = 
> timeit.Timer("m1=where(x>4);m2=where(x[m1]<7);m11,m12=m1;x[m11[m2]
> ,m12[m2]]","from numarray import 
> arange,where;dim=1000;x=arange(dim*dim);x.shape=(dim,dim)")
> >>> t4 = timeit.Timer("x[(x>4) & (x<7)]","from numarray import 
> arange,where;dim=1000;x=arange(dim*dim);x.shape=(dim,dim)")
> >>> t3.repeat(3,10)
> [3.1818649768829346, 3.20477294921875, 3.190640926361084]
> >>> t4.repeat(3,10)
> [0.42328095436096191, 0.42140507698059082, 0.41979002952575684]
> 
> as you see, now the difference is almost an order of magnitude (!).
> 
> So, perhaps assuming the small memory overhead, in most of cases it is
> better to use boolean selections. However, it would be nice to know the
> ultimate reason of why this happens, because the Perry approach seems
> intuitively faster.
>
Yes I agree. It was good of you to post these timings. I don't 
think we had actually compared the two approaches though the
results don't surprise me (though I suspect the results may change
if the first mask has a very small percentage of elements; the
large timing test has nearly all elements selected for the first
mask).

Perry