[Numpy-discussion] suggested change of behavior for interp

Wed Jun 5 13:48:21 EDT 2013

On Wed, Jun 5, 2013 at 6:08 PM, Julian Taylor
<jtaylor.debian at googlemail.com> wrote:
> On 05.06.2013 16:33, Nathaniel Smith wrote:
>> The slow down people are worried about is, suppose that 'xp' has
>> 1,000,000 entries, and the user wants to interpolate 1 point. If we
>> can assume the array is sorted, then we can find which bin the 1 point
>> falls into by using binary search (np.searchsorted), which will
>> require examining ~log2(1,000,000) = 20 entries in the array. Checking
>> that it's sorted, though, will require examining all 1,000,000 points
>> -- it turns an O(log n) operation into an O(n) operation. And if this
>> is being called as part of some larger algorithm, this effect can
>> cascade -- if it gets called m times from a loop, now that's O(mn)
>> instead of (m log n), etc. That's often the difference between
>> tractable and intractable.
>>
>> If you submit a PR that gives interp the option to check for
>> monotonicity, then we'll almost certainly merge it :-). If you also
>> make it the default then there might be some debate, though I'd argue
>> for it...
>
> I would not like it when the performance goes from log to linear by default.
> It is documented that the coordinates must be increasing after all.

I agree, I don't like it either. But the problem is there are two
contradictory things and I don't like either of them:
1) performance going from log to linear by default (for a subset of
somewhat extreme cases)
2) people silently getting the wrong results (and according to reports
in this thread, the warning in the docs does not actually prevent
this)

The question is which one do we dislike more. My feeling is that in
the cases where (1) comes up, it will annoy people and get them to
check the docs and find the go-faster switch, while the first warning
of (2) is when your spaceship crashes, so we should worry more about
(2).

> How about as a compromise the function checks one or two closest
> neighbors around the interpolation point and warns/errors if those are
> not monotonic.
>
> Its not fool proof but should at least catch the very wrong cases with
> an acceptable performance loss.

There are tons of ways for data to accidentally end up sorted within
each local region, but unsorted overall, e.g., if you're combining
results from parallel simulation runs. Such data would systematically
pass this check, but still give nonsensical answers.

-n