[Numpy-discussion] Does a `mergesorted` function make sense?

Wed Sep 3 12:33:02 EDT 2014

On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn <
hoogendoorn.eelco at gmail.com> wrote:

>  Not sure about the hashing. Indeed one can also build an index of a set
> by means of a hash table, but its questionable if this leads to improved
> performance over performing an argsort. Hashing may have better asymptotic
> time complexity in theory, but many datasets used in practice are very easy
> to sort (O(N)-ish), and the time-constant of hashing is higher. But more
> importantly, using a hash-table guarantees poor cache behavior for many
> operations using this index. By contrast, sorting may (but need not) make
> one random access pass to build the index, and may (but need not) perform
> one random access to reorder values for grouping. But insofar as the keys
> are better behaved than pure random, this coherence will be exploited.
>

If you want to give it a try, these branch of my numpy fork has hash table
based implementations of unique (with no extra indices) and in1d:

https://github.com/jaimefrio/numpy/tree/hash-unique

A use cases where the hash table is clearly better:

In [1]: import numpy as np
In [2]: from numpy.lib._compiled_base import _unique, _in1d

In [3]: a = np.random.randint(10, size=(10000,))
In [4]: %timeit np.unique(a)
1000 loops, best of 3: 258 us per loop
In [5]: %timeit _unique(a)
10000 loops, best of 3: 143 us per loop
In [6]: %timeit np.sort(_unique(a))
10000 loops, best of 3: 149 us per loop

It typically performs between 1.5x and 4x faster than sorting. I haven't
profiled it properly to know, but there may be quite a bit of performance
to dig out: have type specific comparison functions, optimize the starting
hash table size based on the size of the array to avoid reinsertions...

If getting the elements sorted is a necessity, and the array contains very
few or no repeated items, then the hash table approach may even perform
worse,:

In [8]: a = np.random.randint(10000, size=(5000,))
In [9]: %timeit np.unique(a)
1000 loops, best of 3: 277 us per loop
In [10]: %timeit np.sort(_unique(a))
1000 loops, best of 3: 320 us per loop

But the hash table still wins in extracting the unique items only:

In [11]: %timeit _unique(a)
10000 loops, best of 3: 187 us per loop

Where the hash table shines is in more elaborate situations. If you keep
the first index where it was found, and the number of repeats, in the hash
table, you can get return_index and return_counts almost for free, which
means you are performing an extra 3x faster than with sorting.
return_inverse requires a little more trickery, so I won;t attempt to
quantify the improvement. But I wouldn't be surprised if, after fine tuning
it, there is close to an order of magnitude overall improvement

The spped-up for in1d is also nice:

In [16]: a = np.random.randint(1000, size=(1000,))
In [17]: b = np.random.randint(1000, size=(500,))
In [18]: %timeit np.in1d(a, b)
1000 loops, best of 3: 178 us per loop
In [19]: %timeit _in1d(a, b)
10000 loops, best of 3: 30.1 us per loop

Of course, there is no point in
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140903/9b4a97b2/attachment.html>