[SciPy-User] Removing duplicate cols/rows

Wes McKinney wesmckinn at gmail.com
Mon Dec 19 21:40:49 EST 2011


On Mon, Dec 19, 2011 at 9:38 PM,  <josef.pktd at gmail.com> wrote:
> On Mon, Dec 19, 2011 at 9:17 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>> On Mon, Dec 19, 2011 at 8:55 PM, eat <e.antero.tammi at gmail.com> wrote:
>>> Hi,
>>>
>>> On Tue, Dec 20, 2011 at 3:39 AM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>>>
>>>> On Mon, Dec 19, 2011 at 8:32 PM, eat <e.antero.tammi at gmail.com> wrote:
>>>> > Hi,
>>>> >
>>>> > On Tue, Dec 20, 2011 at 2:58 AM, Wes McKinney <wesmckinn at gmail.com>
>>>> > wrote:
>>>> >>
>>>> >> On Mon, Dec 19, 2011 at 2:58 PM, Warren Weckesser
>>>> >> <warren.weckesser at enthought.com> wrote:
>>>> >> >
>>>> >> >
>>>> >> > On Mon, Dec 19, 2011 at 1:49 PM, Wes McKinney <wesmckinn at gmail.com>
>>>> >> > wrote:
>>>> >> >>
>>>> >> >> On Mon, Dec 19, 2011 at 2:41 PM, eat <e.antero.tammi at gmail.com>
>>>> >> >> wrote:
>>>> >> >> > Hi,
>>>> >> >> >
>>>> >> >> > On Mon, Dec 19, 2011 at 11:59 AM, Sergi Pons Freixes
>>>> >> >> > <sponsfreixes at gmail.com> wrote:
>>>> >> >> >>
>>>> >> >> >> Hi All,
>>>> >> >> >>
>>>> >> >> >> I'm using a 2D shape array to store pairs of
>>>> >> >> >> longitudes+latitudes.
>>>> >> >> >> At
>>>> >> >> >> one point, I have to merge two of those 2D arrays, and then
>>>> >> >> >> remove
>>>> >> >> >> any
>>>> >> >> >> duplicate entry. I've been searching for a function similar to
>>>> >> >> >> numpy.unique, but I've had no luck. Any implementation I've been
>>>> >> >> >> thinking on looks very "unoptimizied". Is there anything existing
>>>> >> >> >> solution, so I do not reinvent the wheel?
>>>> >> >> >>
>>>> >> >> >> To make it clear, I'm looking for:
>>>> >> >> >> >>> a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])
>>>> >> >> >> >>> unique_rows(a)
>>>> >> >> >> array([[1, 1], [2, 3],[5, 4]])
>>>> >> >> >
>>>> >> >> > A dot product with a random vector may do the trick. like:
>>>> >> >> > In []: a
>>>> >> >> > Out[]:
>>>> >> >> > array([[1, 1],
>>>> >> >> >        [2, 3],
>>>> >> >> >        [1, 1],
>>>> >> >> >        [5, 4],
>>>> >> >> >        [2, 3]])
>>>> >> >> > In []: unique_index= np.unique(a.dot(np.random.rand(2)),
>>>> >> >> > return_index=
>>>> >> >> > True)[1]
>>>> >> >> > In []: a[unique_index]
>>>> >> >> > Out[]:
>>>> >> >> > array([[1, 1],
>>>> >> >> >        [2, 3],
>>>> >> >> >        [5, 4]])
>>>> >> >> >
>>>> >> >> > (and for cols use just transpose of a)
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > My 2 cents,
>>>> >> >> > eat
>>>> >> >> >>
>>>> >> >> >>
>>>> >> >> >> BTW, I wanted to use just a list of tuples for it, but the lists
>>>> >> >> >> were
>>>> >> >> >> so big that they consumed my 4Gb RAM + 4Gb swap (numpy arrays are
>>>> >> >> >> more
>>>> >> >> >> memory efficient).
>>>> >> >> >>
>>>> >> >> >> Regards,
>>>> >> >> >> Sergi
>>>> >> >> >> _______________________________________________
>>>> >> >> >> SciPy-User mailing list
>>>> >> >> >> SciPy-User at scipy.org
>>>> >> >> >> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > _______________________________________________
>>>> >> >> > SciPy-User mailing list
>>>> >> >> > SciPy-User at scipy.org
>>>> >> >> > http://mail.scipy.org/mailman/listinfo/scipy-user
>>>> >> >> >
>>>> >> >>
>>>> >> >> I implemented an efficient function for this in pandas:
>>>> >> >>
>>>> >> >> In [1]: a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])
>>>> >> >>
>>>> >> >> In [2]: df = DataFrame(a)
>>>> >> >>
>>>> >> >> In [3]: df
>>>> >> >> Out[3]:
>>>> >> >>   0  1
>>>> >> >> 0  1  1
>>>> >> >> 1  2  3
>>>> >> >> 2  1  1
>>>> >> >> 3  5  4
>>>> >> >> 4  2  3
>>>> >> >>
>>>> >> >> In [4]: df.drop_duplicates()
>>>> >> >> Out[4]:
>>>> >> >>   0  1
>>>> >> >> 0  1  1
>>>> >> >> 1  2  3
>>>> >> >> 3  5  4
>>>> >> >>
>>>> >> >> you can get just the ndarray back by df.drop_duplicates().values
>>>> >> >>
>>>> >> >> - Wes
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > Or...
>>>> >> >
>>>> >> > In [44]: x
>>>> >> > Out[44]:
>>>> >> > array([[3, 3],
>>>> >> >        [3, 2],
>>>> >> >        [2, 1],
>>>> >> >        [3, 3],
>>>> >> >        [1, 2],
>>>> >> >        [3, 1],
>>>> >> >        [1, 3],
>>>> >> >        [1, 1],
>>>> >> >        [2, 3],
>>>> >> >        [3, 2],
>>>> >> >        [1, 1],
>>>> >> >        [3, 3],
>>>> >> >        [1, 1],
>>>> >> >        [3, 2],
>>>> >> >        [3, 2]])
>>>> >> >
>>>> >> > In [45]: u =
>>>> >> >
>>>> >> >
>>>> >> > unique(x.view(dtype=dtype([('a',x.dtype),('b',x.dtype)]))).view(x.dtype).reshape(-1,2)
>>>> >> >
>>>> >> > In [46]: u
>>>> >> > Out[46]:
>>>> >> > array([[1, 1],
>>>> >> >        [1, 2],
>>>> >> >        [1, 3],
>>>> >> >        [2, 1],
>>>> >> >        [2, 3],
>>>> >> >        [3, 1],
>>>> >> >        [3, 2],
>>>> >> >        [3, 3]])
>>>> >> >
>>>> >> >
>>>> >> > The 'one-liner' above converts x to a 1D structured array with two
>>>> >> > fields,
>>>> >> > then applies numpy.unique to the 1D array, and then converts that
>>>> >> > result
>>>> >> > back to a 2D array.
>>>> >> >
>>>> >> > Warren
>>>> >> >
>>>> >> >
>>>> >> > _______________________________________________
>>>> >> > SciPy-User mailing list
>>>> >> > SciPy-User at scipy.org
>>>> >> > http://mail.scipy.org/mailman/listinfo/scipy-user
>>>> >> >
>>>> >
>>>> > Hi,
>>>> >
>>>> >> That is cool. I found it interesting that np.unique is really slow on
>>>> >> record arrays (the DataFrame method, dict-based under the hood, is
>>>> >> about 5x faster). Is it doing tuple comparison?
>>>> >
>>>> > np.unique seems to be quite slow indeed. Also the number of columns
>>>> > seems
>>>> > need to be harcoded.
>>>> >
>>>> > An slightly off-topic issue is that it doesn't even preserve the order
>>>> > of
>>>> > 'first occurrences' of the duplicate rows. Does your dict based
>>>> > implementation respect this requirement?
>>>> >
>>>> >
>>>> > Regards,
>>>> > eat
>>>> >>
>>>> >> _______________________________________________
>>>> >> SciPy-User mailing list
>>>> >> SciPy-User at scipy.org
>>>> >> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>> >
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > SciPy-User mailing list
>>>> > SciPy-User at scipy.org
>>>> > http://mail.scipy.org/mailman/listinfo/scipy-user
>>>> >
>>>>
>>>> Yes-- it also has the option to use the last observation too.
>>>
>>> Very cool indeed. Does it make any (significant) difference, in performance
>>> wise, to choose either first or last occurrence?
>>>
>>> Regards,
>>> eat
>>>>
>>>> _______________________________________________
>>>> SciPy-User mailing list
>>>> SciPy-User at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>>
>>>
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>
>> Nope, no significant difference. Here's the algorithm:
>>
>> https://github.com/wesm/pandas/blob/master/pandas/src/groupby.pyx#L487
>
> As far as I understand this requires hashability, so you still need to
> convert rows to tuples first, and it wouldn't work with text as object
> arrays.
>
> Or do I misread this?
>
> Josef
>
>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user

What would you store in an object array that is not hashable?



More information about the SciPy-User mailing list