[Pythonmac-SIG] Fast way to get a list of unique values?

Larry Meyn lmeyn@mail.arc.nasa.gov
Mon, 9 Sep 2002 14:14:01 -0700


--Apple-Mail-2--388249761
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
	charset=US-ASCII;
	format=flowed

Here's a function from the Python Cookbook:

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52560

def unique(s):
     """Return a list of the elements in s, but without duplicates.

     For example, unique([1,2,3,1,2,3]) is some permutation of [1,2,3],
     unique("abcabc") some permutation of ["a", "b", "c"], and
     unique(([1, 2], [2, 3], [1, 2])) some permutation of
     [[2, 3], [1, 2]].

     For best speed, all sequence elements should be hashable.  Then
     unique() will usually work in linear time.

     If not possible, the sequence elements should enjoy a total
     ordering, and if list(s).sort() doesn't raise TypeError it's
     assumed that they do enjoy a total ordering.  Then unique() will
     usually work in O(N*log2(N)) time.

     If that's not possible either, the sequence elements must support
     equality-testing.  Then unique() will usually work in quadratic
     time.
     """

     n = len(s)
     if n == 0:
         return []

     # Try using a dict first, as that's the fastest and will usually
     # work.  If it doesn't work, it will usually fail quickly, so it
     # usually doesn't cost much to *try* it.  It requires that all the
     # sequence elements be hashable, and support equality comparison.
     u = {}
     try:
         for x in s:
             u[x] = 1
     except TypeError:
         del u  # move on to the next method
     else:
         return u.keys()

     # We can't hash all the elements.  Second fastest is to sort,
     # which brings the equal elements together; then duplicates are
     # easy to weed out in a single pass.
     # NOTE:  Python's list.sort() was designed to be efficient in the
     # presence of many duplicate elements.  This isn't true of all
     # sort functions in all languages or libraries, so this approach
     # is more effective in Python than it may be elsewhere.
     try:
         t = list(s)
         t.sort()
     except TypeError:
         del t  # move on to the next method
     else:
         assert n > 0
         last = t[0]
         lasti = i = 1
         while i < n:
             if t[i] != last:
                 t[lasti] = last = t[i]
                 lasti += 1
             i += 1
         return t[:lasti]

     # Brute force is all that's left.
     u = []
     for x in s:
         if x not in u:
             u.append(x)
     return u


On Monday, September 9, 2002, at 01:48  PM, Robb Brown wrote:

>
> I have a large array (Numeric) of integers.  I'd like to get a list of 
> all the unique values in that array.  Naturally I'd like to avoid a 
> for loop.  Is there an easy way to do this in Python?
>
> Thanks,
>
> Robb
> -- 
> ______________________________
> Robb Brown
> Seaman Family MR Research Centre
> Calgary, Alberta, Canada
>
> _______________________________________________
> Pythonmac-SIG maillist  -  Pythonmac-SIG@python.org
> http://mail.python.org/mailman/listinfo/pythonmac-sig
>
>


Larry Meyn
Aerospace Operations Modeling Office

M/S 210-10                      Phone:  (650) 604-5038
NASA Ames Research Center       Fax:    (650) 604-0222
Moffett Field, CA 94035-1000    E-mail: lmeyn@mail.arc.nasa.gov
                                 E-Fax:  (425) 944-5526 sent via e-mail

--Apple-Mail-2--388249761
Content-Transfer-Encoding: 7bit
Content-Type: text/enriched;
	charset=US-ASCII

Here's a function from the Python Cookbook:


http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52560


def unique(s):

    """Return a list of the elements in s, but without duplicates.


    For example, unique([1,2,3,1,2,3]) is some permutation of [1,2,3],

    unique("abcabc") some permutation of ["a", "b", "c"], and

    unique(([1, 2], [2, 3], [1, 2])) some permutation of

    [[2, 3], [1, 2]].


    For best speed, all sequence elements should be hashable.  Then

    unique() will usually work in linear time.


    If not possible, the sequence elements should enjoy a total

    ordering, and if list(s).sort() doesn't raise TypeError it's

    assumed that they do enjoy a total ordering.  Then unique() will

    usually work in O(N*log2(N)) time.


    If that's not possible either, the sequence elements must support

    equality-testing.  Then unique() will usually work in quadratic

    time.

    """


    n = len(s)

    if n == 0:

        return []


    # Try using a dict first, as that's the fastest and will usually

    # work.  If it doesn't work, it will usually fail quickly, so it

    # usually doesn't cost much to *try* it.  It requires that all the

    # sequence elements be hashable, and support equality comparison.

    u = {}

    try:

        for x in s:

            u[x] = 1

    except TypeError:

        del u  # move on to the next method

    else:

        return u.keys()


    # We can't hash all the elements.  Second fastest is to sort,

    # which brings the equal elements together; then duplicates are

    # easy to weed out in a single pass.

    # NOTE:  Python's list.sort() was designed to be efficient in the

    # presence of many duplicate elements.  This isn't true of all

    # sort functions in all languages or libraries, so this approach

    # is more effective in Python than it may be elsewhere.

    try:

        t = list(s)

        t.sort()

    except TypeError:

        del t  # move on to the next method

    else:

        assert n > 0

        last = t[0]

        lasti = i = 1

        while i << n:

            if t[i] != last:

                t[lasti] = last = t[i]

                lasti += 1

            i += 1

        return t[:lasti]


    # Brute force is all that's left.

    u = []

    for x in s:

        if x not in u:

            u.append(x)

    return u



On Monday, September 9, 2002, at 01:48  PM, Robb Brown wrote:


<excerpt>

I have a large array (Numeric) of integers.  I'd like to get a list of
all the unique values in that array.  Naturally I'd like to avoid a
for loop.  Is there an easy way to do this in Python?


Thanks,


Robb

-- 

______________________________

Robb Brown

Seaman Family MR Research Centre

Calgary, Alberta, Canada


_______________________________________________

Pythonmac-SIG maillist  -  Pythonmac-SIG@python.org

http://mail.python.org/mailman/listinfo/pythonmac-sig



</excerpt>


<fontfamily><param>Courier</param>Larry Meyn

Aerospace Operations Modeling Office


M/S 210-10                      Phone:  (650) 604-5038

NASA Ames Research Center       Fax:    (650) 604-0222

Moffett Field, CA 94035-1000    E-mail: lmeyn@mail.arc.nasa.gov

                                E-Fax:  (425) 944-5526 sent via e-mail

</fontfamily>
--Apple-Mail-2--388249761--