looping and searching in numpy array

Mon Mar 14 11:22:30 EDT 2016

On 10 March 2016 at 13:02, Peter Otten <__peter__ at web.de> wrote:
> Heli wrote:
>
>> I need to loop over a numpy array and then do the following search. The
>> following is taking almost 60(s) for an array (npArray1 and npArray2 in
>> the example below) with around 300K values.
>>
>>
>> for id in np.nditer(npArray1):
>>        newId=(np.where(npArray2==id))[0][0]

What are the dtypes of the arrays? And what are the typical sizes of
each of them. It can have a big effect on what makes a good solution
to the problem.

>> Is there anyway I can make the above faster? I need to run the script
>> above on much bigger arrays (50M). Please note that my two numpy arrays in
>> the lines above, npArray1 and npArray2  are not necessarily the same size,
>> but they are both 1d.
>
> You mean you are looking for the index of the first occurence in npArray2
> for every value of npArray1?
>
> I don't know how to do this in numpy (I'm not an expert), but even basic
> Python might be acceptable:

I'm not sure that numpy has any particular function that can be of use
here. Your approach below looks good though.

> lookup = {}
> for i, v in enumerate(npArray2):
>     if v not in lookup:
>         lookup[v] = i

Looking at this I wondered if there was a way to avoid the double hash
table lookup and realised it's the first time I've ever considered a
use for setdefault:

for i, v in enumerate(npArray2):
     lookup.setdefault(i, v)

Another option would be to use this same algorithm in Cython. Then you
can access the ndarray data pointer directly and loop over it in C.
This is the kind of scenario where that sort of thing can be well
worth doing.

--
Oscar