[Numpy-discussion] Does a `mergesorted` function make sense?

Jeff Reback jeffreback at gmail.com
Thu Sep 4 14:36:58 EDT 2014


FYI pandas DOES use a very performant hash table impl for unique (and
value_counts). Sorted state IS maintained
by underlying Index implmentation.
https://github.com/pydata/pandas/blob/master/pandas/hashtable.pyx

In [8]: a = np.random.randint(10, size=(10000,))

In [9]: %timeit np.unique(a)
1000 loops, best of 3: 284 µs per loop

In [10]: %timeit Series(a).unique()
10000 loops, best of 3: 161 µs per loop

In [11]: s = Series(a)

# without the creation overhead
In [12]: %timeit s.unique()
10000 loops, best of 3: 75.3 µs per loop



On Thu, Sep 4, 2014 at 2:29 PM, Eelco Hoogendoorn <
hoogendoorn.eelco at gmail.com> wrote:

>
> On Thu, Sep 4, 2014 at 8:14 PM, Eelco Hoogendoorn <
> hoogendoorn.eelco at gmail.com> wrote:
>
>> I should clarify: I am speaking about my implementation, I havnt looked
>> at the numpy implementation for a while so im not sure what it is up to.
>> Note that by 'almost free', we are still talking about three passes over
>> the whole array plus temp allocations, but I am assuming a use-case where
>> the various sorts involved are the dominant cost, which I imagine they are,
>> for out-of-cache sorts. Perhaps this isn't too realistic an assumption
>> about the average use case though, I don't know. Though I suppose its a
>> reasonable guideline to assume that either the dataset is big, or
>> performance isn't that big a concern in the first place.
>>
>>
>> On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fernández del Río <
>> jaime.frio at gmail.com> wrote:
>>
>>> On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn <
>>> hoogendoorn.eelco at gmail.com> wrote:
>>>
>>>>
>>>> On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn <
>>>> hoogendoorn.eelco at gmail.com> wrote:
>>>>
>>>>>
>>>>> On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río <
>>>>> jaime.frio at gmail.com> wrote:
>>>>>
>>>>>>  On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río <
>>>>>> jaime.frio at gmail.com> wrote:
>>>>>>
>>>>>>> On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn <
>>>>>>> hoogendoorn.eelco at gmail.com> wrote:
>>>>>>>
>>>>>>>>  Not sure about the hashing. Indeed one can also build an index of
>>>>>>>> a set by means of a hash table, but its questionable if this leads to
>>>>>>>> improved performance over performing an argsort. Hashing may have better
>>>>>>>> asymptotic time complexity in theory, but many datasets used in practice
>>>>>>>> are very easy to sort (O(N)-ish), and the time-constant of hashing is
>>>>>>>> higher. But more importantly, using a hash-table guarantees poor cache
>>>>>>>> behavior for many operations using this index. By contrast, sorting may
>>>>>>>> (but need not) make one random access pass to build the index, and may (but
>>>>>>>> need not) perform one random access to reorder values for grouping. But
>>>>>>>> insofar as the keys are better behaved than pure random, this coherence
>>>>>>>> will be exploited.
>>>>>>>>
>>>>>>>
>>>>>>> If you want to give it a try, these branch of my numpy fork has hash
>>>>>>> table based implementations of unique (with no extra indices) and in1d:
>>>>>>>
>>>>>>>  https://github.com/jaimefrio/numpy/tree/hash-unique
>>>>>>>
>>>>>>> A use cases where the hash table is clearly better:
>>>>>>>
>>>>>>> In [1]: import numpy as np
>>>>>>> In [2]: from numpy.lib._compiled_base import _unique, _in1d
>>>>>>>
>>>>>>> In [3]: a = np.random.randint(10, size=(10000,))
>>>>>>> In [4]: %timeit np.unique(a)
>>>>>>> 1000 loops, best of 3: 258 us per loop
>>>>>>> In [5]: %timeit _unique(a)
>>>>>>> 10000 loops, best of 3: 143 us per loop
>>>>>>> In [6]: %timeit np.sort(_unique(a))
>>>>>>> 10000 loops, best of 3: 149 us per loop
>>>>>>>
>>>>>>> It typically performs between 1.5x and 4x faster than sorting. I
>>>>>>> haven't profiled it properly to know, but there may be quite a bit of
>>>>>>> performance to dig out: have type specific comparison functions, optimize
>>>>>>> the starting hash table size based on the size of the array to avoid
>>>>>>> reinsertions...
>>>>>>>
>>>>>>> If getting the elements sorted is a necessity, and the array
>>>>>>> contains very few or no repeated items, then the hash table approach may
>>>>>>> even perform worse,:
>>>>>>>
>>>>>>> In [8]: a = np.random.randint(10000, size=(5000,))
>>>>>>> In [9]: %timeit np.unique(a)
>>>>>>> 1000 loops, best of 3: 277 us per loop
>>>>>>> In [10]: %timeit np.sort(_unique(a))
>>>>>>> 1000 loops, best of 3: 320 us per loop
>>>>>>>
>>>>>>> But the hash table still wins in extracting the unique items only:
>>>>>>>
>>>>>>> In [11]: %timeit _unique(a)
>>>>>>> 10000 loops, best of 3: 187 us per loop
>>>>>>>
>>>>>>> Where the hash table shines is in more elaborate situations. If you
>>>>>>> keep the first index where it was found, and the number of repeats, in the
>>>>>>> hash table, you can get return_index and return_counts almost for free,
>>>>>>> which means you are performing an extra 3x faster than with sorting.
>>>>>>> return_inverse requires a little more trickery, so I won;t attempt to
>>>>>>> quantify the improvement. But I wouldn't be surprised if, after fine tuning
>>>>>>> it, there is close to an order of magnitude overall improvement
>>>>>>>
>>>>>>> The spped-up for in1d is also nice:
>>>>>>>
>>>>>>> In [16]: a = np.random.randint(1000, size=(1000,))
>>>>>>> In [17]: b = np.random.randint(1000, size=(500,))
>>>>>>> In [18]: %timeit np.in1d(a, b)
>>>>>>> 1000 loops, best of 3: 178 us per loop
>>>>>>> In [19]: %timeit _in1d(a, b)
>>>>>>> 10000 loops, best of 3: 30.1 us per loop
>>>>>>>
>>>>>>> Of course, there is no point in
>>>>>>>
>>>>>>
>>>>>> Ooops!!! Hit the send button too quick. Not to extend myself too
>>>>>> long: if we are going to rethink all of this, we should approach it with an
>>>>>> open mind. Still, and this post is not helping with that either, I am
>>>>>> afraid that we are discussing implementation details, but are missing a
>>>>>> broader vision of what we want to accomplish and why. That vision of what
>>>>>> numpy's grouping functionality, if any, should be, and how it complements
>>>>>> or conflicts with what pandas is providing, should precede anything else. I
>>>>>> now I haven't, but has anyone looked at how pandas implements grouping?
>>>>>> Their documentation on the subject is well worth a read:
>>>>>>
>>>>>> http://pandas.pydata.org/pandas-docs/stable/groupby.html
>>>>>>
>>>>>> Does numpy need to replicate this? What/why/how can we add to that?
>>>>>>
>>>>>> Jaime
>>>>>>
>>>>>> --
>>>>>> (\__/)
>>>>>> ( O.o)
>>>>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
>>>>>> planes de dominación mundial.
>>>>>>
>>>>>> _______________________________________________
>>>>>> NumPy-Discussion mailing list
>>>>>> NumPy-Discussion at scipy.org
>>>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>>>
>>>>>> I would certainly not be opposed to having a hashing based indexing
>>>>> mechanism; I think it would make sense design-wise to have a HashIndex
>>>>> class with the same interface as the rest, and use that subclass in those
>>>>> arraysetops where it makes sense. The 'how to' of indexing and its
>>>>> applications are largely orthogonal I think (with some tiny performance
>>>>> compromises which are worth the abstraction imo). For datasets which are
>>>>> not purely random, have many unique items, and which do not fit into cache,
>>>>> I would expect sorting to come out on top, but indeed it depends on the
>>>>> dataset.
>>>>>
>>>>> Yeah, the question how pandas does grouping, and whether we can do
>>>>> better, is a relevant one.
>>>>>
>>>>> From what I understand, pandas relies on cython extensions to get
>>>>> vectorized grouping functionality. This is no longer necessary since the
>>>>> introduction of ufuncs in numpy. I don't know how the implementations
>>>>> compare in terms of performance, but I doubt the difference is huge.
>>>>>
>>>>> I personally use grouping a lot in my code, and I don't like having to
>>>>> use pandas for it. Most importantly, I don't want to go around creating a
>>>>> dataframe for a single one-line hit-and-run association between keys and
>>>>> values. The permanent association of different types of data and their
>>>>> metadata which pandas offers is I think the key difference from numpy,
>>>>> which is all about manipulating just plain ndarrays. Arguably, grouping
>>>>> itself is a pretty elementary manipulating of ndarrays, and adding calls to
>>>>> DataFrame or Series inbetween a statement that could just be
>>>>> simply group_by(keys).mean(values) feels wrong to me. As does including
>>>>> pandas as a dependency just to use this small piece of
>>>>> functionality. Grouping is a more general functionality than any particular
>>>>> method of organizing your data.
>>>>>
>>>>> In terms of features, adding transformations and filtering might be
>>>>> nice too; I hadn't thought about it, but that is because unlike the
>>>>> currently implemented features, the need has never arose for me. Im only a
>>>>> small sample size, and I don't see any fundamental objection to adding such
>>>>> functionality though. It certainly raises the question as to where to draw
>>>>> the line with pandas; but my rule of thumb is that if you can think of it
>>>>> as an elementary operation on ndarrays, then it probably belongs in numpy.
>>>>>
>>>>>
>>>> Oh I forgot to add: with an indexing mechanism based on sorting, unique
>>>> values and counts also come 'for free', not counting the O(N) cost of
>>>> actually creating those arrays. The only time an operating relying on an
>>>> index incurs another nontrivial amount of overhead is in case its 'rank' or
>>>> 'inverse' property is used, which invokes another argsort. But for the vast
>>>> majority of grouping or set operations, these properties are never used.
>>>>
>>>
>>> That extra argsort is now gone from master:
>>>
>>> https://github.com/numpy/numpy/pull/5012
>>>
>>> Even with this improvement, returning any index typically makes
>>> `np.unique` run at least 2x slower:
>>>
>>> In [1]: import numpy as np
>>> In [2]: a = np.random.randint(100, size=(1000,))
>>> In [3]: %timeit np.unique(a)
>>> 10000 loops, best of 3: 37.3 us per loop
>>> In [4]: %timeit np.unique(a, return_inverse=True)
>>> 10000 loops, best of 3: 62.1 us per loop
>>> In [5]: %timeit np.unique(a, return_index=True)
>>> 10000 loops, best of 3: 72.8 us per loop
>>> In [6]: %timeit np.unique(a, return_counts=True)
>>> 10000 loops, best of 3: 56.4 us per loop
>>> In [7]: %timeit np.unique(a, return_index=True, return_inverse=True,
>>> return_coun
>>> ts=True)
>>> 10000 loops, best of 3: 112 us per loop
>>>
>>> --
>>> (\__/)
>>> ( O.o)
>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
>>> planes de dominación mundial.
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>
> Yeah, I looked at the numpy implementation, and it seems these speed
> differences are simply the result of the extra O(N) costs involved, so my
> implementation would have the same characteristics. If another array copy
> or two has meaningful impact on performance, then you are done as far as
> optimization within numpy is concerned, id say. You could fuse those loops
> on the C level, but as you said I think its better to think about these
> kind of optimizations once we have a more complete picture of the
> functionality we want.
>
> Good call on removing the extra argsort, that hadn't occurred to me.
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140904/739846b3/attachment.html>


More information about the NumPy-Discussion mailing list