Merging ordered lists

Peter Otten __peter__ at web.de
Sun Jun 1 04:34:49 EDT 2008


etal wrote:

> Here's an algorithm question: How should I efficiently merge a
> collection of mostly similar lists, with different lengths and
> arbitrary contents, while eliminating duplicates and preserving order
> as much as possible?
> 
> My code:
> 
> def merge_to_unique(sources):
>     """Merge the unique elements from each list in sources into new
> list.
> 
>     Using the longest input list as a reference, merges in the
> elements from
>     each of the smaller or equal-length lists, and removes duplicates.
> 
>     @return: Combined list of elements.
>     """
>     sources.sort(None, len, True)    # Descending length
>     ref = sources[0]
>     for src in sources[1:]:
>         for i, s in enumerate(src):
>             if s and (ref[i] != s) and s not in ref:
>                 ref.insert(ref.index(src[i-1])+1, s)
>     # Remove duplicates
>     return [r for i, r in enumerate(ref) if r and r not in ref[i+1:]]
> 
> 
> This comes up with using the CSV module's DictWriter class to merge a
> set (list, here) of not-quite-perfect CSV sources. The DictWriter
> constructor needs a list of field names so that it can convert
> dictionaries into rows of the CSV file it writes. Some of the input
> CSV files are missing columns, some might have extras -- all of this
> should be accepted, and the order of the columns in the merged file
> should match the order of the input files as much as possible (not
> alphabetical). All of the list elements are strings, in this case, but
> it would be nice if the function didn't require it.
> 
> Speed actually isn't a problem yet; it might matter some day, but for
> now it's just an issue of conceptual aesthetics. Any suggestions?

#untested
import difflib

def _merge(a, b):
    sm = difflib.SequenceMatcher(None, a, b)
    for op, a1, a2, b1, b2 in sm.get_opcodes():
        if op == "insert":
            yield b[b1:b2]
        else:
            yield a[a1:a2]

def merge(a, b):
    return sum(_merge(a, b), [])

def merge_to_unique(sources):
    return reduce(merge, sorted(sources, key=len, reverse=True))

Peter



More information about the Python-list mailing list