merge & de-duplicate lists

Alex Martelli aleax at aleax.it
Wed Oct 8 05:30:00 EDT 2003


Alan Little wrote:

> I need to merge and de-duplicate some lists, and I have some code
> which works but doesn't seem particularly elegant. I was wondering if
> somebody could point me at a cleaner way to do it.
> 
> Here's my function:
> 
> +++++++++++++++++++
> 
> from sets import Set
> 
> def mergeLists (intialList, sourceOfAdditionalLists,
> nameOfMethodToCall) :
>     workingSet = Set(initialList)
>     for s in sourceOfAdditionalLists :
>         getList = s.__getAttribute__(nameOfMethodToCall)

Normal expression of this line would rather be:
          getList = getattr(s, nameOfMethodToCall)

>         workingSet = workingSet.union(Set \
>             (callable(getList) and getList() or getList))
>     return workingSet
> 
> ++++++++++++++
> 
> Two questions - passing the *name* of the method to call, and then
> looking it up for each object in the list of extra sources (all of
> which need to be new-style objects - not a problem in my application)
> seems inelegant. My "sourcesOfAdditionalLists" are normally all of the
> same class - is there something I can bind at class level that
> automagically refers to instance-level attributes when invoked?

I'm not quite sure what you mean here.  You can of course play
around with descriptors, e.g. properties or your own custom ones.
But that 'normally' is worrisome -- what happens in the not-so-
normal cases where one or two items are of a different class...?


> Second (hopefully clearer & less obscure) question : is
> sets.Set.union() an efficient way to do list de-duplication? Seems
> like the obvious tool for the job.

Well, .union must make a new set each time and then you throw
away the old one; this is inefficient in much the same way in
which concatenating a list of lists would be if you coded that:
    result = baselist
    for otherlist in otherlists:
        result = result + otherlist
here, too, the + would each time make a new list and you would
throw away the old one with the assignment.  This inefficiency
is removed by in-place updates, e.g. result.extend(otherlist)
for the case of list concatenation, and
    workingSet.update(otherlist)
for your case (don't bother explicitly making a Set out of
the otherlist -- update can take any iterable).

Overall, I would code your function (including some
renamings for clarity, and the removal of a very error
prone, obscure and useless and/or -- just imagine what
would happen if getList() returned an EMPTY list...) as
follows:

def mergeLists(intialList, sourceOfAdditionalLists,
                nameOfAttribute):
    workingSet = Set(initialList)
    for s in sourceOfAdditionalLists :
        getList = getattr(s, nameOfAttribute)
        if callable(getList): getList=getList()
        workingSet.update(getList)
    return workingSet


Alex





More information about the Python-list mailing list