Merging multiple sorted sequences.

Cameron Simpson cs at zip.com.au
Wed Apr 12 18:52:13 EDT 2017


On 12Apr2017 23:15, Erik <python at lucidity.plus.com> wrote:
>I need to be able to lazily merge a variable number of already-sorted(*) 
>variable-length sequences into a single sorted sequence. The merge should 
>continue until the longest sequence has been exhausted.
>
>(*) They may in practice be a lazy source of data known to only ever 
>be generated in an order that's "sorted" from the POV of this 
>function.
>
>I have the following - can anyone think of any improvements (other 
>than indentation style on the try/excepts ;) )? This seems like the 
>sort of thing for which there might be a built-in or recipe that I've 
>missed.

It can be improved. The big cost in yours is notionally the sort, which you do 
on every yield.

For comparison, here's mine:

  https://bitbucket.org/cameron_simpson/css/src/tip/lib/python/cs/seq.py?fileviewer=file-view-default#seq.py-131

It keeps a heap (see the "heapq" standard module) containing tuples of the 
latest item from each source and the source iterable i.e. (item, iterable).

That way the lowest item is always first on the heap, so you pop it off, which 
gets you (item, iterable). Yield the item. Fetch the next item from iterable.  
If that gets StopIteration, that is exhausted. Otherwise insert the new item 
and the iterable back into the heap.

Repeat until the heap is empty.

In this way you discard exhausted iterables and never need to re-examine them, 
and the sorting overhead is just heap insertion which is O(log n) for the heap 
size. Your method is O(n log n) for each item (sorting the collection of head 
items).

The heap is faster because it makes a much weaker ordering guarentee, and for 
us the feature is that the front of the heap is the lowest item; the rest of 
the heap is partially ordered, not fully ordered. Insert and pop preserve that 
criterion.

Cheers,
Cameron Simpson <cs at zip.com.au>



More information about the Python-list mailing list