sorting many arrays from one...

Alex Martelli aleax at aleax.it
Mon Jul 15 04:28:56 EDT 2002


On Monday 15 July 2002 09:47 am, Tim Peters wrote:
> [Alex Martelli, on list.sort() special cases]
>
> > Right.  The only extra case I've often wished it handled (defining
> > "often" as "more than once":-) is when the FIRST element is out of
> > order and all the rest are OK.  That happens in mergesorting huge
> > streams,
>
> Is the number of streams also large?  Just curious.

That strictly depends on the ratio between the size of the "huge" stream, and 
the amount of physical memory that you can devote to sorting it.  It's not
easy to pin that down, since it may depend on what else you might like to
be doing with the physical memory actually available on your machine!  If
you're sorting a stream of 4 GB, and would like to avoid taking up more than
roughly 16MB to prepare each sorted sub-stream, that's 256 streams, for
example.

Of course, you could cut the number of streams being merged at one time down 
to any arbitrary number >= 2, by doing a multipass merge.  The "output" stream
at any given time is rewinded and becomes an input stream for the next pass; a
classic and rather elegant solution.  However, single-pass IS substantially
faster and simpler, if you can afford it.

In real life, it's been quite a while since I had a _substantial_ number of
streams being merged in a straight mergesort.  However, the same merge logic
comes in handy when the pre-sorted streams are generated otherwise, not
by splitting-and-sorting an initial huge stream as in real mergesort.


> > in the core merge loop:
> >
> >     while streams:
> >         yield streams[0][0]
> >         try:
> >             streams[0][0] = streams[0][2]()
> >         except StopIteration:
> >             del streams[0]
> >         else:
> >             streams.sort()  # this is the first-maybe-OOO sort
>
> Understood, and that's reasonable.  I expect the best a general sort method
> can do with this is worst-case N+log2(N) comparisons, give or take 1 or 2,
> where N=len(streams), the bulk of that because it has to establish that
> only one is out of order (it takes N-1 comparisons just to establish that
> an already-sorted list is already sorted).  I have in mind a simple way to
> achieve that, so hold your breath <wink>.

Yes, I see your point.  I *do* know that streams[1:] IS sorted, while
the sort method needs to re-establish this fact each and every time.

> By arranging the streams in a min-heap yourself, you can guarantee
> worst-case log2(N) comparisons.  If the number of streams can be large,
> that's a monster-big improvement; it's already a factor-of-3 improvement
> over N+log2(N) for N as small as 7.

Yes, but the "del streams[0]" ensures I can't do better than O(N) anyway --
for the *general* case of a leg of this loop.  However, it's true that the del
happens very rarely -- only when a stream finishes, and these streams
tend to be quite large.  Hmmm... yes, it's probably well worth using a heap.


> > The "streams" auxiliary list is originally built from the sequence X
> > of already-sorted streams as:
> >
> > streams = [ [X[i].next(), i, X[i].next] for i in range(len(X)) ]
>
> Ooh!  In 2.3 you'll be able to say
>
>     streams = [[s.next(), i, s.next] for (i, s) in enumerate(X)]
>
> It grows on you.

As I helped Raymond implement the first cut of enumerate (even though
I hear that Guido later had to half-redo it:-), I'm reasonably familiar with
it -- and it's one of the things I'd love to backport to 2.2.Tie one day:-).

But of course I'd never use the redundant parentheses in the for clause
of the list comprehension!-)


> > Normally N==len(streams) isn't all that big, and I COULD make each
> > leg of the "while streams:" loop O(N) anyway by using knowledge
> > about _which_ special cases method sort is so good at -- changing
> > the else clause into:
> >
> >         else:
> >             streams.append(streams.pop(0))
> >             streams.sort() # the maybe-OOO item is now *LAST*...
> >
> > but that's not all that pleasant -- I'd do it only in emergencies...
>
> That would indeed get you out with N+log2(N) comparisons, N of which you
> know aren't needed, but which list.sort() can't know aren't needed.  If it
> is an emergency someday <wink>, I'd try
>
>         else:
>             s = streams.pop(0)
>             bisect.insort(streams, s)
>
> instead.  The smallest N for which that pays depends on how expensive
> element comparisons are, but the crossover point is generally small.  The

Yep.  A "maintain-heap" primitive might be even preferable, if we had it...

> main loop can also be recoded naturally with less fiddly chained indexing
> then:
>
>      insert = bisect.insort
>      while streams:
>          value, i, fetch = streams.pop(0)
>          yield value
>          try:
>              value = fetch()
>          except StopIteration:
>              pass
>          else:
>              insert(streams, [value, i, fetch])
>
> This kind of thing goes faster if streams is made a list of 3-tuples
> instead of 3-lists, if you're concerned about micro-optimization here.

I wonder (in the micro-optimization hypothesis).  All of these tuple
packings and unpackings... one thing I've tried is still using indexing,
via lists (so I can rebind the value), but via symbols:

VALUE, I, FETCH = range(3)

while streams:
    triple = streams.pop(0)
    yield triple[VALUE]
    try:
        triple[VALUE] = triple[FETCH]()

&c.  Oh well, guess I should measure rather than wondering...:-).


Alex





More information about the Python-list mailing list