[Python-Dev] Sorting

Mon, 22 Jul 2002 22:07:57 -0400

This is a multi-part message in MIME format.

--Boundary_(ID_G6Ak++OVPb3/j8dlAUI+lw)
Content-type: text/plain; charset=Windows-1252
Content-transfer-encoding: 7BIT

In an effort to save time on email (ya, right ...), I wrote up a pretty
detailed overview of the "timsort" algorithm.  It's attached.

all-will-be-revealed-ly y'rs  - tim

--Boundary_(ID_G6Ak++OVPb3/j8dlAUI+lw)
Content-type: text/plain; name=timsort.txt
Content-transfer-encoding: quoted-printable
Content-disposition: attachment; filename=timsort.txt

/*-----------------------------------------------------------------------=
----
A stable natural mergesort with excellent performance on many flavors of
lightly disordered arrays, and as fast as samplesort on random arrays.

In a nutshell, the main routine marches over the array once, left to =
right,
alternately identifying the next run, and then merging it into the =
previous
runs.  Everything else is complication for speed, and some measure of =
memory
efficiency.

Runs
----
count_run() returns the # of elements in the next run.  A run is either
"ascending", which means non-decreasing:

    a0 <=3D a1 <=3D a2 <=3D ...

or "descending", which means strictly decreasing:

    a0 > a1 > a2 > ...

Note that a run is always at least 2 long, unless we start at the =
array's
last element.

The definition of descending is strict, because the main routine =
reverses
a descending run in-place, transforming a descending run into an =
ascending
run.  Reversal is done via the obvious fast "swap elements starting at =
each
end, and converge at the middle" method, and that can violate stability =
if
the slice contains any equal elements.  Using a strict definition of
descending ensures that a descending run contains distinct elements.

If an array is random, it's very unlikely we'll see long runs, much of =
the
rest of the algorithm is geared toward exploiting long runs, and that =
takes
a fair bit of work.  That work is a waste of time if the data is random, =
so
if a natural run contains less than MIN_MERGE_SLICE elements, the main =
loop
artificially boosts it to MIN_MERGE_SLICE elements, via binary insertion
sort applied to the right number of array elements following the short
natural run.  In a random array, *all* runs are likely to be =
MIN_MERGE_SLICE
long as a result, and merge_at() short-circuits the expensive stuff in =
that
case.

The Merge Pattern
-----------------
In order to exploit regularities in the data, we're merging on natural
run lengths, and they can become wildly unbalanced.  But that's a Good =
Thing
for this sort!

Stability constrains permissible merging patterns.  For example, if we =
have
3 consecutive runs of lengths

    A:10000  B:20000  C:10000

we dare not merge A with C first, because if A, B and C happen to =
contain
a common element, it would get out of order wrt its occurence(s) in B.  =
The
merging must be done as (A+B)+C or A+(B+C) instead.

So merging is always done on two consecutive runs at a time, and =
in-place,
although this may require some temp memory (more on that later).

When a run is identified, its base address and length are pushed on a =
stack
in the MergeState struct.  merge_collapse() is then called to see =
whether
it should merge it with preceeding run(s).  We would like to delay =
merging
as long as possible in order to exploit patterns that may come up later, =
but
we would like to do merging as soon as possible to exploit that the run =
just
found is still high in the memory hierarchy.  We also can't delay =
merging
"too long" because it consumes memory to remember the runs that are =
still
unmerged, and the stack has a fixed size.

What turned out to be a good compromise maintains two invariants on the
stack entries, where A, B and C are the lengths of the three righmost =
not-yet
merged slices:

1.   A > B+C
2.   B > C

Note that, by induction, #2 implies the lengths of pending runs form a
decreasing sequence.  #1 implies that, reading the lengths right to =
left,
the pending-run lengths grow at least as fast as the Fibonacci numbers.
Therefore the stack can never grow larger than about log_base_phi(N) =
entries,
where phi =3D (1+sqrt(5))/2 ~=3D 1.618.  Thus a small # of stack slots =
suffice
for very large arrays.

If A <=3D B+C, the smaller of A and C is merged with B, and the new run =
replaces
the A,B or B,C entries; e.g., if the last 3 entries are

    A:30  B:20  C:10

then B is merged with C, leaving

    A:30  BC:30

on the stack.  Or if they were

    A:500  B:400:  C:1000

then A is merged with B, leaving

    AB:900  C:1000

on the stack.

In both examples, the stack configuration still violates invariant #2, =
and
merge_at() goes on to continue merging runs until both invariants are
satisfied.  As an extreme case, suppose we didn't do the MIN_MERGE_SLICE
gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2, and =
2.
Nothing would get merged until the final 2 was seen, and that would =
trigger
7 perfectly balanced (both runs involved have the same size) merges.

The thrust of these rules when they trigger merging is to balance the =
run
lengths as closely as possible, while keeping a low bound on the number
of runs we have to remember.  This is maximally effective for random =
data,
where all runs are likely to be of (artificially forced) length
MIN_MERGE_SLICE, and then we get a sequence of perfectly balanced =
merges.

OTOH, the reason this sort is so good for lightly disordered data has to =
do
with wildly unbalanced run lengths.

Merge Memory
------------
Merging adjacent runs of lengths A and B in-place is very difficult.
Theoretical constructions are known that can do it, but they're too =
difficult
and slow for practical use.  But if we have temp memory equal to min(A, =
B),
it's easy.

If A is smaller, copy A to a temp array, leave B alone, and then we can
do the obvious merge algorithm left to right, from the temp area and B,
starting the stores into where A used to live.  There's always a free =
area
in the original area comprising a number of elements equal to the number
not yet merged from the temp array (trivially true at the start; proceed
by induction).  The only tricky bit is that if a comparison raises an
exception, we have to remember to copy the remaining elements back in =
from
the temp area, lest the array end up with duplicate entries from B.

If B is smaller, much the same, except that we need to merge right to =
left,
starting the stores at the right end of where B used to live.

In all, then, we need no more than N/2 temp array slots.

A refinement:  When we're about to merge adjacent runs A and B, we first
do a form of binary search (more on that later) to see where B[0] should
end up in A.  Elements in A preceding that point are already in their =
final
positions, effectively shrinking the size of A.  Likewise we also search
to see where A[-1] should end up in B, and elements of B after that =
point
can also be ignored.  This cuts the amount of temp memory needed by the
same amount.  It may not pay, though.

Merge Algorithms
----------------
When merging runs of lengths A and B, if A/2 <=3D B <=3D 2*A (i.e., =
they're
within a factor of two of each other), we do the usual straightforward =
one-at-
a-time merge.  This can take up to A+B comparisons.  If the data is =
random,
there's very little potential for doing better than that.  If there are =
a
great many equal elements, we can do better than that, but there's no =
way
to know whether there *are* a great many equal elements short of doing a
great many additional comparisons (we only use "<" in sort), and that's
too expensive when it doesn't pay.

If the sizes of A and B are out of whack, we can do much better.  The
Hwang-Lin merging algorithm is very good at merging runs of mismatched
lengths if the data is random, but I believe it would be a mistake to
try that here.  As explained before, if we really do have random data, =
we're
almost certainly going to stay in the A/2 <=3D B <=3D 2*A case.

Instead we assume that wildly different run lengths correspond to *some*
sort of clumpiness in the data.  Without loss of generality, assume A is
the shorter run.  We first look for A[0] in B.  We do this via =
"galloping",
comparing A[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ...,
until finding the k such that B[2**(k-1) - 1] < A[0] <=3D B[2**k - 1].  =
This
takes at most log2(B) comparisons, and, unlike a straight binary search,
favors finding the right spot early in B.  Why that's important may =
<wink>
become clear later.

After finding such a k, the region of uncertainty is reduced to 2**(k-1) =
- 1
consecutive elements, and a straight binary search requires exactly k-1
comparisons to nail it.

Now we can copy all the B's up to that point in one chunk, and then copy =
A[0].
If the data really is clustered, the new A[0] (what was A[1] at the =
start)
is likely to belong near the start of what remains of the B run.  That's
why we gallop first instead of doing a straight binary search:  if the =
new
A[0] really is near the start of the remaining B run, galloping will =
find it
much quicker.  OTOH, if we're wrong, galloping + binary search never =
takes
more than 2*log2(B) compares, so can't become a disaster.  If the =
clumpiness
comes in distinct clusters, gallop + binary search also adapts nicely to
that.

I first learned about the galloping strategy in a related context; do a
Google search to find this paper available online:

    "Adaptive Set Intersections, Unions, and Differences" (2000)
    Erik D. Demaine, Alejandro L=F3pez-Ortiz, J. Ian Munro

and its followup(s).
-------------------------------------------------------------------------=
--*/

--Boundary_(ID_G6Ak++OVPb3/j8dlAUI+lw)--