[Numpy-discussion] Does a `mergesorted` function make sense?

Wed Sep 3 09:41:51 EDT 2014

On Wed, Sep 3, 2014 at 4:07 AM, Jaime Fernández del Río <
jaime.frio at gmail.com> wrote:

> On Tue, Sep 2, 2014 at 5:40 PM, Charles R Harris <
> charlesr.harris at gmail.com> wrote:
>>
>>
>> What do you think about the suggestion of timsort? One would need to
>> concatenate the arrays before sorting, but it should be fairly efficient.
>>
>
> Timsort is very cool, and it would definitely be fun to implement in
> numpy. It is also a lot more work that merging two sorted arrays! I think
> +1 if someone else does it, but although I would love to be part of it, I
> am not sure I will be able to find time to look into it seriously in the
> next couple of months.
>
> From a setops point of view, merging two sorted arrays makes it very
> straightforward to compute, together with (or instead of) the result of the
> merge, index arrays that let you calculate things like `in1d` faster.
> Although perhaps an `argtimsort` could provide the same functionality, not
> sure. I will probably wrap up what I have, put a lace on it, and submit it
> as a PR. Even if it is not destined to be merged, it may serve as a warning
> to others.
>
> I have also been thinking lately that one of the problems with all these
> unique-related computations may be a case of having a hammer and seeing
> everything as nails. Perhaps the algorithm that needs to be ported from
> Python is not the sorting one, but the hash table...
>
> Jaime
>
>
 Not sure about the hashing. Indeed one can also build an index of a set by
means of a hash table, but its questionable if this leads to improved
performance over performing an argsort. Hashing may have better asymptotic
time complexity in theory, but many datasets used in practice are very easy
to sort (O(N)-ish), and the time-constant of hashing is higher. But more
importantly, using a hash-table guarantees poor cache behavior for many
operations using this index. By contrast, sorting may (but need not) make
one random access pass to build the index, and may (but need not) perform
one random access to reorder values for grouping. But insofar as the keys
are better behaved than pure random, this coherence will be exploited.

Also, getting the unique values/keys in sorted order is a nice side-benefit
for many applications.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140903/aaae1984/attachment.html>