[Numpy-discussion] Proposal: np.search() to complement np.searchsorted()

Mon May 15 16:59:28 EDT 2017

On May 9, 2017 9:47 AM, "Martin Spacek" <numpy at mspacek.mm.st> wrote:

Hello,

I've opened up a pull request to add a function called np.search(), or
something like it, to complement np.searchsorted():

https://github.com/numpy/numpy/pull/9055

There's also this issue I opened before starting the PR:

https://github.com/numpy/numpy/issues/9052

Proposed API changes require discussion on the list, so here I am!

This proposed function (and perhaps array method?) does the same as
np.searchsorted(a, v), but doesn't require `a` to be sorted, and explicitly
checks if all the values in `v` are a subset of those in `a`. If not, it
currently raises an error, but that could be controlled via a kwarg.

As I mentioned in the PR, I often find myself abusing np.searchsorted() by
not explicitly checking these assumptions. The temptation to use it is
great, because it's such a fast and convenient function, and most of the
time that I use it, the assumptions are indeed valid. Explicitly checking
those assumptions each and every time before I use np.searchsorted() is
tedious, and easy to forget to do. I wouldn't be surprised if many others
abuse np.searchsorted() in the same way.

It's worth noting though that the "sorted" part is a critical part of what
makes it fast. If we're looking for k needles in an n-item haystack, then:

If the haystack is already sorted and we know it, using searchsorted does
it in k*log2(n) comparisons. (Could be reduced to average case O(k log log
n) for simple scalars by using interpolation search, but I don't think
searchsorted is that clever atm.)

If the haystack is not sorted, then sorting it and then using searchsorted
requires a total of O(n log n) + k*log2(n) comparisons.

And if the haystack is not sorted, then doing linear search to find the
first item like list.index does requires on average 0.5*k*n comparisons.

This analysis ignores memory effects, which are important -- linear memory
access is faster than random access, and searchsorted is all about making
memory access maximally unpredictable. But even so, I think
sorting-then-searching will be reasonably competitive pretty much from the
start, and for moderately large k and n values the difference between (n +
k)*log(n) and n*k is huge.

Another issue is that sorting requires an O(n)-sized temporary buffer
(assuming you can't mutate the haystack in place). But if your haystack is
a large enough fraction of memory that you can't afford is buffer, then
it's likely large enough that you can't afford linear searching either...

Looking at my own habits and uses, it seems to me that finding the indices
of matching values of one array in another is a more common use case than
finding insertion indices of one array into another sorted array. So, I
propose that np.search(), or something like it, could be even more useful
than np.searchsorted().

My main concern here would be creating a trap for the unwary, where people
use search() naively because it's so nice and convenient, and then
eventually get surprised by a nasty quadratic slowdown. There's a whole
blog about these traps :-) https://accidentallyquadratic.tumblr.com/

Otoh there are also huge number of numpy use cases where it doesn't matter
if some calculation is 1000x slower than it should be, as long as it works
and is discoverable...

So it sounds like one obvious thing would be to have a version of
searchsorted that checks for matches (maybe side="exact"? Though that's not
easy to find...). That's clearly useful, and orthogonal to the
linear/binary search issue, so we shouldn't make it a reason people are
tempted to choose the inferior algorithm.

...ok, how's this for a suggestion. Give np.search a strategy= kwarg, with
options "linear", "searchsorted", and "auto". Linear does the obvious
thing, searchsorted generates a sorter array using argsort (unless the user
provided one) and then calls searchsorted, and auto picks one of them
depending on whether a sorter array was provided and how large the arrays
are. The default is auto. In all cases it looks for exact matches.

I guess by default "not found" should be signaled with an exception, and
then there should be some option to have it return a sentinel value
instead? The problem is that since we're returning integers then there's no
sentinel value that's necessarily an invalid index (e.g. indexing will
happily accept -1).

-n
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170515/8a5ef7d0/attachment.html>