[SciPy-User] Removing duplicate cols/rows

Mon Dec 19 20:39:10 EST 2011

On Mon, Dec 19, 2011 at 8:32 PM, eat <e.antero.tammi at gmail.com> wrote:
> Hi,
>
> On Tue, Dec 20, 2011 at 2:58 AM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>
>> On Mon, Dec 19, 2011 at 2:58 PM, Warren Weckesser
>> <warren.weckesser at enthought.com> wrote:
>> >
>> >
>> > On Mon, Dec 19, 2011 at 1:49 PM, Wes McKinney <wesmckinn at gmail.com>
>> > wrote:
>> >>
>> >> On Mon, Dec 19, 2011 at 2:41 PM, eat <e.antero.tammi at gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > On Mon, Dec 19, 2011 at 11:59 AM, Sergi Pons Freixes
>> >> > <sponsfreixes at gmail.com> wrote:
>> >> >>
>> >> >> Hi All,
>> >> >>
>> >> >> I'm using a 2D shape array to store pairs of longitudes+latitudes.
>> >> >> At
>> >> >> one point, I have to merge two of those 2D arrays, and then remove
>> >> >> any
>> >> >> duplicate entry. I've been searching for a function similar to
>> >> >> numpy.unique, but I've had no luck. Any implementation I've been
>> >> >> thinking on looks very "unoptimizied". Is there anything existing
>> >> >> solution, so I do not reinvent the wheel?
>> >> >>
>> >> >> To make it clear, I'm looking for:
>> >> >> >>> a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])
>> >> >> >>> unique_rows(a)
>> >> >> array([[1, 1], [2, 3],[5, 4]])
>> >> >
>> >> > A dot product with a random vector may do the trick. like:
>> >> > In []: a
>> >> > Out[]:
>> >> > array([[1, 1],
>> >> >        [2, 3],
>> >> >        [1, 1],
>> >> >        [5, 4],
>> >> >        [2, 3]])
>> >> > In []: unique_index= np.unique(a.dot(np.random.rand(2)),
>> >> > return_index=
>> >> > True)[1]
>> >> > In []: a[unique_index]
>> >> > Out[]:
>> >> > array([[1, 1],
>> >> >        [2, 3],
>> >> >        [5, 4]])
>> >> >
>> >> > (and for cols use just transpose of a)
>> >> >
>> >> >
>> >> > My 2 cents,
>> >> > eat
>> >> >>
>> >> >>
>> >> >> BTW, I wanted to use just a list of tuples for it, but the lists
>> >> >> were
>> >> >> so big that they consumed my 4Gb RAM + 4Gb swap (numpy arrays are
>> >> >> more
>> >> >> memory efficient).
>> >> >>
>> >> >> Regards,
>> >> >> Sergi
>> >> >> _______________________________________________
>> >> >> SciPy-User mailing list
>> >> >> SciPy-User at scipy.org
>> >> >> http://mail.scipy.org/mailman/listinfo/scipy-user
>> >> >
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > SciPy-User mailing list
>> >> > SciPy-User at scipy.org
>> >> > http://mail.scipy.org/mailman/listinfo/scipy-user
>> >> >
>> >>
>> >> I implemented an efficient function for this in pandas:
>> >>
>> >> In [1]: a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])
>> >>
>> >> In [2]: df = DataFrame(a)
>> >>
>> >> In [3]: df
>> >> Out[3]:
>> >>   0  1
>> >> 0  1  1
>> >> 1  2  3
>> >> 2  1  1
>> >> 3  5  4
>> >> 4  2  3
>> >>
>> >> In [4]: df.drop_duplicates()
>> >> Out[4]:
>> >>   0  1
>> >> 0  1  1
>> >> 1  2  3
>> >> 3  5  4
>> >>
>> >> you can get just the ndarray back by df.drop_duplicates().values
>> >>
>> >> - Wes
>> >
>> >
>> >
>> > Or...
>> >
>> > In [44]: x
>> > Out[44]:
>> > array([[3, 3],
>> >        [3, 2],
>> >        [2, 1],
>> >        [3, 3],
>> >        [1, 2],
>> >        [3, 1],
>> >        [1, 3],
>> >        [1, 1],
>> >        [2, 3],
>> >        [3, 2],
>> >        [1, 1],
>> >        [3, 3],
>> >        [1, 1],
>> >        [3, 2],
>> >        [3, 2]])
>> >
>> > In [45]: u =
>> >
>> > unique(x.view(dtype=dtype([('a',x.dtype),('b',x.dtype)]))).view(x.dtype).reshape(-1,2)
>> >
>> > In [46]: u
>> > Out[46]:
>> > array([[1, 1],
>> >        [1, 2],
>> >        [1, 3],
>> >        [2, 1],
>> >        [2, 3],
>> >        [3, 1],
>> >        [3, 2],
>> >        [3, 3]])
>> >
>> >
>> > The 'one-liner' above converts x to a 1D structured array with two
>> > fields,
>> > then applies numpy.unique to the 1D array, and then converts that result
>> > back to a 2D array.
>> >
>> > Warren
>> >
>> >
>> > _______________________________________________
>> > SciPy-User mailing list
>> > SciPy-User at scipy.org
>> > http://mail.scipy.org/mailman/listinfo/scipy-user
>> >
>
> Hi,
>
>> That is cool. I found it interesting that np.unique is really slow on
>> record arrays (the DataFrame method, dict-based under the hood, is
>> about 5x faster). Is it doing tuple comparison?
>
> np.unique seems to be quite slow indeed. Also the number of columns seems
> need to be harcoded.

It doesn't need to be hardcoded, since an array is homogenous, we can
just use [('',a.dtype)]*a.shape[1] or something like this. (that was
one of my first experiments with structured dtypes)

Josef

>
> An slightly off-topic issue is that it doesn't even preserve the order of
> 'first occurrences' of the duplicate rows. Does your dict based
> implementation respect this requirement?
>
>
> Regards,
> eat
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>