Merging ordered lists
Peter Otten
__peter__ at web.de
Sun Jun 1 04:34:49 EDT 2008
etal wrote:
> Here's an algorithm question: How should I efficiently merge a
> collection of mostly similar lists, with different lengths and
> arbitrary contents, while eliminating duplicates and preserving order
> as much as possible?
>
> My code:
>
> def merge_to_unique(sources):
> """Merge the unique elements from each list in sources into new
> list.
>
> Using the longest input list as a reference, merges in the
> elements from
> each of the smaller or equal-length lists, and removes duplicates.
>
> @return: Combined list of elements.
> """
> sources.sort(None, len, True) # Descending length
> ref = sources[0]
> for src in sources[1:]:
> for i, s in enumerate(src):
> if s and (ref[i] != s) and s not in ref:
> ref.insert(ref.index(src[i-1])+1, s)
> # Remove duplicates
> return [r for i, r in enumerate(ref) if r and r not in ref[i+1:]]
>
>
> This comes up with using the CSV module's DictWriter class to merge a
> set (list, here) of not-quite-perfect CSV sources. The DictWriter
> constructor needs a list of field names so that it can convert
> dictionaries into rows of the CSV file it writes. Some of the input
> CSV files are missing columns, some might have extras -- all of this
> should be accepted, and the order of the columns in the merged file
> should match the order of the input files as much as possible (not
> alphabetical). All of the list elements are strings, in this case, but
> it would be nice if the function didn't require it.
>
> Speed actually isn't a problem yet; it might matter some day, but for
> now it's just an issue of conceptual aesthetics. Any suggestions?
#untested
import difflib
def _merge(a, b):
sm = difflib.SequenceMatcher(None, a, b)
for op, a1, a2, b1, b2 in sm.get_opcodes():
if op == "insert":
yield b[b1:b2]
else:
yield a[a1:a2]
def merge(a, b):
return sum(_merge(a, b), [])
def merge_to_unique(sources):
return reduce(merge, sorted(sources, key=len, reverse=True))
Peter
More information about the Python-list
mailing list