Efficiently removing duplicate rows from a 2-dimensional Numeric array

Matt McCredie mccredie at gmail.com
Thu Jul 19 19:56:59 EDT 2007


Could you use a set of tuples?

>>> set([(1,2),(1,3),(1,2),(2,3)])
set([(1, 2), (1, 3), (2, 3)])

Matt

On 7/19/07, Alex Mont <t-alexm at windows.microsoft.com> wrote:
>
>  I have a 2-dimensional Numeric array with the shape (2,N) and I want to
> remove all duplicate rows from the array. For example if I start out with:
>
> [[1,2],
>
> [1,3],
>
> [1,2],
>
> [2,3]]
>
>
>
> I want to end up with
>
> [[1,2],
>
> [1,3],
>
> [2,3]].
>
>
>
> (Order of the rows doesn't matter, although order of the two elements in
> each row does.)
>
>
>
> The problem is that I can't find any way of doing this that is efficient
> with large data sets (in the data set I am using, N > 1000000)
>
> The normal method of removing duplicates by putting the elements into a
> dictionary and then reading off the keys doesn't work directly because the
> keys – rows of Python arrays – aren't hashable.
>
> The best I have been able to do so far is:
>
>
>
> def remove_duplicates(x):
>
>                 d = {}
>
>                 for (a,b) in x:
>
>                                 d[(a,b)] = (a,b)
>
>                 return array(x.values())
>
>
>
> According to the profiler the loop takes about 7 seconds and the call to
> array() 10 seconds with N=1,700,000.
>
>
>
> Is there a faster way to do this using Numeric?
>
>
>
> -Alex Mont
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20070719/8a32cfd5/attachment.html>


More information about the Python-list mailing list