[Numpy-discussion] unique 2d arrays

Tue Sep 21 13:29:03 EDT 2010

On Tue, Sep 21, 2010 at 1:55 AM, Peter Schmidtke
<pschmidtke at mmb.pcb.ub.es>wrote:

> Dear all,
>
> I'd like to know if there is a pythonic / numpy way of retrieving unique
> lines of a 2d numpy array.
>
> In a way I have this :
>
> [[409 152]
>  [409 152]
>  [409 152]
>  [409 152]
>  [409 152]
>  [409 152]
>  [409 152]
>  [409 152]
>  [409 152]
>  [409 152]
>  [409 152]
>  [426 193]
>  [431 129]]
>
> And I'd like to get this :
>
> [[409 152]
>  [426 193]
>  [431 129]]
>
>
> How can I do this without workarounds like string concatenation or such
> things? Numpy.unique flattens the whole array so it's not really of use
> here.
>

Here is one alternative:

I[15]: a = np.array([[409, 152], [409, 152], [426, 193], [431, 129]])

I[16]: np.array(list(set(tuple(i) for i in a.tolist())))
O[16]:
array([[409, 152],
       [426, 193],
       [431, 129]])

I[6]: %timeit
np.unique(a.view([('',a.dtype)]*a.shape[1])).view(a.dtype).reshape(-1,a.shape[1])
10000 loops, best of 3: 51 us per loop

I[8]: %timeit np.array(list(set(tuple(i) for i in a.tolist())))
10000 loops, best of 3: 31.4 us per loop

# Try with a bigger array
I[9]: k = np.array((a.tolist()*50000))

I[10]: %timeit np.array(list(set(tuple(i) for i in k.tolist())))
1 loops, best of 3: 324 ms per loop

I[11]: %timeit
np.unique(k.view([('',k.dtype)]*k.shape[1])).view(k.dtype).reshape(-1,k.shape[1])
1 loops, best of 3: 790 ms per loop

Seems like faster on these tests comparing to the unique method. Also it is
more readable. Still not uber Pythonic. Haskell has "nub" to remove
duplicate list elements.
http://www.haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/Data-List.html#v%3Anub

-- 
Gökhan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20100921/63a035ec/attachment.html>