[Numpy-discussion] Implementing a "find first" style function

Tue Mar 26 05:20:34 EDT 2013

Bump.

I'd be interested to know if this is a desirable feature for numpy?
(specifically the 1D "find" functionality rather than the "any"/"all" also
discussed)
If so, I'd be more than happy to submit a PR, but I don't want to put in
the effort if the principle isn't desirable in the core of numpy.

Cheers,

On 8 March 2013 17:38, Phil Elson <pelson.pub at gmail.com> wrote:

> Interesting. I hadn't thought of those. I've implemented (very roughly
> without a sound logic check) and benchmarked:
>
> def my_any(a, predicate, chunk_size=2048):
>     try:
>         next(find(a, predicate, chunk_size))
>         return True
>     except StopIteration:
>         return False
>
> def my_all(a, predicate, chunk_size=2048):
>     return not my_any(a, lambda a: ~predicate(a), chunk_size)
>
>
> With the following setup:
>
> import numpy as np
> import numpy.random
>
> np.random.seed(1)
> a = np.random.randn(1e8)
>
>
> For a low frequency *any*:
>
> In [12]: %timeit (np.abs(a) > 6).any()
> 1 loops, best of 3: 1.29 s per loop
>
> In [13]: %timeit my_any(a, lambda a: np.abs(a) > 6)
>
> 1 loops, best of 3: 792 ms per loop
>
> In [14]: %timeit my_any(a, lambda a: np.abs(a) > 6, chunk_size=10000)
> 1 loops, best of 3: 654 ms per loop
>
> For a False *any*:
>
> In [16]: %timeit (np.abs(a) > 7).any()
> 1 loops, best of 3: 1.22 s per loop
>
> In [17]: %timeit my_any(a, lambda a: np.abs(a) > 7)
> 1 loops, best of 3: 2.4 s per loop
>
> For a high probability *any*:
>
> In [28]: %timeit (np.abs(a) > 1).any()
> 1 loops, best of 3: 972 ms per loop
>
> In [27]: %timeit my_any(a, lambda a: np.abs(a) > 1)
> 10000 loops, best of 3: 67 us per loop
>
> ---------------
>
> For a low probability *all*:
>
> In [18]: %timeit (np.abs(a) < 6).all()
> 1 loops, best of 3: 1.16 s per loop
>
> In [19]: %timeit my_all(a, lambda a: np.abs(a) < 6)
> 1 loops, best of 3: 880 ms per loop
>
> In [20]: %timeit my_all(a, lambda a: np.abs(a) < 6, chunk_size=10000)
> 1 loops, best of 3: 706 ms per loop
>
> For a True *all*:
>
> In [22]: %timeit (np.abs(a) < 7).all()
> 1 loops, best of 3: 1.47 s per loop
>
> In [23]: %timeit my_all(a, lambda a: np.abs(a) < 7)
> 1 loops, best of 3: 2.65 s per loop
>
> For a high probability *all*:
>
> In [25]: %timeit (np.abs(a) < 1).all()
> 1 loops, best of 3: 978 ms per loop
>
> In [26]: %timeit my_all(a, lambda a: np.abs(a) < 1)
> 10000 loops, best of 3: 73.6 us per loop
>
>
>
>
>
>
>
> On 6 March 2013 21:16, Benjamin Root <ben.root at ou.edu> wrote:
>
>>
>>
>> On Tue, Mar 5, 2013 at 9:15 AM, Phil Elson <pelson.pub at gmail.com> wrote:
>>
>>> The ticket https://github.com/numpy/numpy/issues/2269 discusses the
>>> possibility of implementing a "find first" style function which can
>>> optimise the process of finding the first value(s) which match a predicate
>>> in a given 1D array. For example:
>>>
>>>
>>> >>> a = np.sin(np.linspace(0, np.pi, 200))
>>> >>> print find_first(a, lambda a: a > 0.9)
>>> ((71, ), 0.900479032457)
>>>
>>>
>>> This has been discussed in several locations:
>>>
>>> https://github.com/numpy/numpy/issues/2269
>>> https://github.com/numpy/numpy/issues/2333
>>>
>>> http://stackoverflow.com/questions/7632963/numpy-array-how-to-find-index-of-first-occurrence-of-item
>>>
>>>
>>> *Rationale*
>>>
>>> For small arrays there is no real reason to avoid doing:
>>>
>>> >>> a = np.sin(np.linspace(0, np.pi, 200))
>>> >>> ind = (a > 0.9).nonzero()[0][0]
>>> >>> print (ind, ), a[ind]
>>> (71,) 0.900479032457
>>>
>>>
>>> But for larger arrays, this can lead to massive amounts of work even if
>>> the result is one of the first to be computed. Example:
>>>
>>> >>> a = np.arange(1e8)
>>> >>> print (a == 5).nonzero()[0][0]
>>> 5
>>>
>>>
>>> So a function which terminates when the first matching value is found is
>>> desirable.
>>>
>>> As mentioned in #2269, it is possible to define a consistent ordering
>>> which allows this functionality for >1D arrays, but IMHO it overcomplicates
>>> the problem and was not a case that I personally needed, so I've limited
>>> the scope to 1D arrays only.
>>>
>>>
>>> *Implementation*
>>>
>>> My initial assumption was that to get any kind of performance I would
>>> need to write the *find* function in C, however after prototyping with
>>> some array chunking it became apparent that a trivial python function would
>>> be quick enough for my needs.
>>>
>>> The approach I've implemented in the code found in #2269 simply breaks
>>> the array into sub-arrays of maximum length *chunk_size* (2048 by
>>> default, though there is no real science to this number), applies the given
>>> predicating function, and yields the results from *nonzero()*. The
>>> given function should be a python function which operates on the whole of
>>> the sub-array element-wise (i.e. the function should be vectorized).
>>> Returning a generator also has the benefit of allowing users to get the
>>> first *n* matching values/indices.
>>>
>>>
>>> *Results*
>>>
>>>
>>> I timed the implementation of *find* found in my comment at
>>> https://github.com/numpy/numpy/issues/2269#issuecomment-14436725 with
>>> an obvious test:
>>>
>>>
>>> In [1]: from np_utils import find
>>>
>>> In [2]: import numpy as np
>>>
>>> In [3]: import numpy.random
>>>
>>> In [4]: np.random.seed(1)
>>>
>>> In [5]: a = np.random.randn(1e8)
>>>
>>> In [6]: a.min(), a.max()
>>> Out[6]: (-6.1194900990552776, 5.9632246301166321)
>>>
>>> In [7]: next(find(a, lambda a: np.abs(a) > 6))
>>> Out[7]: ((33105441,), -6.1194900990552776)
>>>
>>> In [8]: (np.abs(a) > 6).nonzero()
>>> Out[8]: (array([33105441]),)
>>>
>>> In [9]: %timeit (np.abs(a) > 6).nonzero()
>>> 1 loops, best of 3: 1.51 s per loop
>>>
>>> In [10]: %timeit next(find(a, lambda a: np.abs(a) > 6))
>>> 1 loops, best of 3: 912 ms per loop
>>>
>>> In [11]: %timeit next(find(a, lambda a: np.abs(a) > 6,
>>> chunk_size=100000))
>>> 1 loops, best of 3: 470 ms per loop
>>>
>>> In [12]: %timeit next(find(a, lambda a: np.abs(a) > 6,
>>> chunk_size=1000000))
>>> 1 loops, best of 3: 483 ms per loop
>>>
>>>
>>> This shows that picking a sensible *chunk_size* can yield massive
>>> speed-ups (nonzero is x3 slower in one case). A similar example with a much
>>> smaller 1D array shows similar promise:
>>>
>>> In [41]: a = np.random.randn(1e4)
>>>
>>> In [42]: %timeit next(find(a, lambda a: np.abs(a) > 3))
>>> 10000 loops, best of 3: 35.8 us per loop
>>>
>>> In [43]: %timeit (np.abs(a) > 3).nonzero()
>>> 10000 loops, best of 3: 148 us per loop
>>>
>>>
>>> As I commented on the issue tracker, if you think this function is worth
>>> taking forward, I'd be happy to open up a pull request.
>>>
>>> Feedback greatfully received.
>>>
>>> Cheers,
>>>
>>> Phil
>>>
>>>
>>>
>> In the interest of generalizing code and such, could such approaches be
>> used for functions like np.any() and np.all() for short-circuiting if True
>> or False (respectively) are found?  I wonder what other sort of functions
>> in NumPy might benefit from this?
>>
>> Ben Root
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20130326/40ea8e09/attachment.html>