[SciPy-User] Removing duplicate cols/rows

Wes McKinney wesmckinn at gmail.com
Mon Dec 19 21:17:58 EST 2011


On Mon, Dec 19, 2011 at 8:55 PM, eat <e.antero.tammi at gmail.com> wrote:
> Hi,
>
> On Tue, Dec 20, 2011 at 3:39 AM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>
>> On Mon, Dec 19, 2011 at 8:32 PM, eat <e.antero.tammi at gmail.com> wrote:
>> > Hi,
>> >
>> > On Tue, Dec 20, 2011 at 2:58 AM, Wes McKinney <wesmckinn at gmail.com>
>> > wrote:
>> >>
>> >> On Mon, Dec 19, 2011 at 2:58 PM, Warren Weckesser
>> >> <warren.weckesser at enthought.com> wrote:
>> >> >
>> >> >
>> >> > On Mon, Dec 19, 2011 at 1:49 PM, Wes McKinney <wesmckinn at gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> On Mon, Dec 19, 2011 at 2:41 PM, eat <e.antero.tammi at gmail.com>
>> >> >> wrote:
>> >> >> > Hi,
>> >> >> >
>> >> >> > On Mon, Dec 19, 2011 at 11:59 AM, Sergi Pons Freixes
>> >> >> > <sponsfreixes at gmail.com> wrote:
>> >> >> >>
>> >> >> >> Hi All,
>> >> >> >>
>> >> >> >> I'm using a 2D shape array to store pairs of
>> >> >> >> longitudes+latitudes.
>> >> >> >> At
>> >> >> >> one point, I have to merge two of those 2D arrays, and then
>> >> >> >> remove
>> >> >> >> any
>> >> >> >> duplicate entry. I've been searching for a function similar to
>> >> >> >> numpy.unique, but I've had no luck. Any implementation I've been
>> >> >> >> thinking on looks very "unoptimizied". Is there anything existing
>> >> >> >> solution, so I do not reinvent the wheel?
>> >> >> >>
>> >> >> >> To make it clear, I'm looking for:
>> >> >> >> >>> a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])
>> >> >> >> >>> unique_rows(a)
>> >> >> >> array([[1, 1], [2, 3],[5, 4]])
>> >> >> >
>> >> >> > A dot product with a random vector may do the trick. like:
>> >> >> > In []: a
>> >> >> > Out[]:
>> >> >> > array([[1, 1],
>> >> >> >        [2, 3],
>> >> >> >        [1, 1],
>> >> >> >        [5, 4],
>> >> >> >        [2, 3]])
>> >> >> > In []: unique_index= np.unique(a.dot(np.random.rand(2)),
>> >> >> > return_index=
>> >> >> > True)[1]
>> >> >> > In []: a[unique_index]
>> >> >> > Out[]:
>> >> >> > array([[1, 1],
>> >> >> >        [2, 3],
>> >> >> >        [5, 4]])
>> >> >> >
>> >> >> > (and for cols use just transpose of a)
>> >> >> >
>> >> >> >
>> >> >> > My 2 cents,
>> >> >> > eat
>> >> >> >>
>> >> >> >>
>> >> >> >> BTW, I wanted to use just a list of tuples for it, but the lists
>> >> >> >> were
>> >> >> >> so big that they consumed my 4Gb RAM + 4Gb swap (numpy arrays are
>> >> >> >> more
>> >> >> >> memory efficient).
>> >> >> >>
>> >> >> >> Regards,
>> >> >> >> Sergi
>> >> >> >> _______________________________________________
>> >> >> >> SciPy-User mailing list
>> >> >> >> SciPy-User at scipy.org
>> >> >> >> http://mail.scipy.org/mailman/listinfo/scipy-user
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > _______________________________________________
>> >> >> > SciPy-User mailing list
>> >> >> > SciPy-User at scipy.org
>> >> >> > http://mail.scipy.org/mailman/listinfo/scipy-user
>> >> >> >
>> >> >>
>> >> >> I implemented an efficient function for this in pandas:
>> >> >>
>> >> >> In [1]: a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])
>> >> >>
>> >> >> In [2]: df = DataFrame(a)
>> >> >>
>> >> >> In [3]: df
>> >> >> Out[3]:
>> >> >>   0  1
>> >> >> 0  1  1
>> >> >> 1  2  3
>> >> >> 2  1  1
>> >> >> 3  5  4
>> >> >> 4  2  3
>> >> >>
>> >> >> In [4]: df.drop_duplicates()
>> >> >> Out[4]:
>> >> >>   0  1
>> >> >> 0  1  1
>> >> >> 1  2  3
>> >> >> 3  5  4
>> >> >>
>> >> >> you can get just the ndarray back by df.drop_duplicates().values
>> >> >>
>> >> >> - Wes
>> >> >
>> >> >
>> >> >
>> >> > Or...
>> >> >
>> >> > In [44]: x
>> >> > Out[44]:
>> >> > array([[3, 3],
>> >> >        [3, 2],
>> >> >        [2, 1],
>> >> >        [3, 3],
>> >> >        [1, 2],
>> >> >        [3, 1],
>> >> >        [1, 3],
>> >> >        [1, 1],
>> >> >        [2, 3],
>> >> >        [3, 2],
>> >> >        [1, 1],
>> >> >        [3, 3],
>> >> >        [1, 1],
>> >> >        [3, 2],
>> >> >        [3, 2]])
>> >> >
>> >> > In [45]: u =
>> >> >
>> >> >
>> >> > unique(x.view(dtype=dtype([('a',x.dtype),('b',x.dtype)]))).view(x.dtype).reshape(-1,2)
>> >> >
>> >> > In [46]: u
>> >> > Out[46]:
>> >> > array([[1, 1],
>> >> >        [1, 2],
>> >> >        [1, 3],
>> >> >        [2, 1],
>> >> >        [2, 3],
>> >> >        [3, 1],
>> >> >        [3, 2],
>> >> >        [3, 3]])
>> >> >
>> >> >
>> >> > The 'one-liner' above converts x to a 1D structured array with two
>> >> > fields,
>> >> > then applies numpy.unique to the 1D array, and then converts that
>> >> > result
>> >> > back to a 2D array.
>> >> >
>> >> > Warren
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > SciPy-User mailing list
>> >> > SciPy-User at scipy.org
>> >> > http://mail.scipy.org/mailman/listinfo/scipy-user
>> >> >
>> >
>> > Hi,
>> >
>> >> That is cool. I found it interesting that np.unique is really slow on
>> >> record arrays (the DataFrame method, dict-based under the hood, is
>> >> about 5x faster). Is it doing tuple comparison?
>> >
>> > np.unique seems to be quite slow indeed. Also the number of columns
>> > seems
>> > need to be harcoded.
>> >
>> > An slightly off-topic issue is that it doesn't even preserve the order
>> > of
>> > 'first occurrences' of the duplicate rows. Does your dict based
>> > implementation respect this requirement?
>> >
>> >
>> > Regards,
>> > eat
>> >>
>> >> _______________________________________________
>> >> SciPy-User mailing list
>> >> SciPy-User at scipy.org
>> >> http://mail.scipy.org/mailman/listinfo/scipy-user
>> >
>> >
>> >
>> > _______________________________________________
>> > SciPy-User mailing list
>> > SciPy-User at scipy.org
>> > http://mail.scipy.org/mailman/listinfo/scipy-user
>> >
>>
>> Yes-- it also has the option to use the last observation too.
>
> Very cool indeed. Does it make any (significant) difference, in performance
> wise, to choose either first or last occurrence?
>
> Regards,
> eat
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>

Nope, no significant difference. Here's the algorithm:

https://github.com/wesm/pandas/blob/master/pandas/src/groupby.pyx#L487



More information about the SciPy-User mailing list