remove duplicates from list *preserving order*

Alex Martelli aleaxit at yahoo.com
Mon Feb 7 04:34:10 EST 2005


Steven Bethard <steven.bethard at gmail.com> wrote:
   ...
> I have a list[1] of objects from which I need to remove duplicates.  I
> have to maintain the list order though, so solutions like set(lst), etc.
> will not work for me.  What are my options?  So far, I can see:

I think the recipe by that subject in the 1st Edition Python Cookbook is
pretty exhaustive wrt the possibilities that were available up to Python
2.2 under various hypotheses of: are the items hashable, do you want to
keep the first one of several duplicates or the last one or some one
chosen by arbitrary criteria, etc, etc.  In 2.3/2.4, the addition of
`set' simplifies many solutions, and itertools simplify others; besides,
it appears that you don't care which duplicate is kept, need use
equality only (not an equivalence-function), and do have hashable items,
so most of the ideas in that recipe aren't needed anyway.


> def filterdups(iterable):
>      seen = set()
>      for item in iterable:
>          if item not in seen:
>              seen.add(item)
>              yield item
> 
> Does anyone have a better[2] solution?
> 
> STeve
> 
> [1] Well, actually it's an iterable of objects, but I can convert it to
> a list if that's helpful.
> 
> [2] Yes I know, "better" is ambiguous.  If it helps any, for my 
> particular situation, speed is probably more important than memory, so
> I'm leaning towards the second or third implementation.

I doubt you can do _substantially_ better than filterdups, unless some
particularly special conditions apply, e.g., the items are hashable but
hashing them is very time-consuming.  In that case you might avoid
hashing items twice, as filterdups does, by some hack such as:

def filterdups_hackish(iterable):
    seen = set()
    olen = 0
    for item in iterable:
        seen.add(item)
        nlen = len(seen)
        if nlen > olen:
            yield item
            olen = nlen

Here's a toy example of how this might help, given costly hashing:

class examp(object):
    def __init__(self, i): self.i = i//2
    def __hash__(self): return self.i**7 % 999999

dups = [examp(x) for x in range(1000)]

def fd1():
    for x in filterdups(dups): pass

def fd2():
    for x in filterdups_hackish(dups): pass


kallisti:~/cb alex$ python -mtimeit -s'import fid' 'fid.fd1()'
10 loops, best of 3: 55.6 msec per loop
kallisti:~/cb alex$ python -mtimeit -s'import fid' 'fid.fd2()'
10 loops, best of 3: 33.4 msec per loop


Probably not worth the bother even for this level of hashing costs and
density of duplicates, as you see, but maybe an idea keeping in mind if
the duplicates are few and hashing is really costly: having both the
membership test AND the insertion means you hash the item twice; the
hackish alternative is to insert anyway (hashing but once) and see if
that made a difference (taking the len is always cheap).


Alex



More information about the Python-list mailing list