[Python-Dev] Sorting

Mon, 22 Jul 2002 13:28:11 -0400

[Barry Warsaw]
> Except that in
>
>     http://mail.python.org/pipermail/python-dev/2002-July/026837.html
>
> Tim says:
>
>     "Back on Earth, among Python users the most frequent complaint
>      I've heard is that list.sort() isn't stable."

Yes, and because the current samplesort falls back to a stable sort when
lists are small, almost everyone who cares about this and tries to guess
about stability via trying small examples comes to a wrong conclusion.

> and here
>
>     http://mail.python.org/pipermail/python-dev/2002-July/026854.html
>
> Tim seems <wink> to be arguing against stable sort as being the
> default due to code bloat.

I'm arguing there against having two highly complex and long-winded sorting
algorithms in the core.  Pick one.  In favor of samplesort:

+ It can be much faster in very-many-equal-elements cases (note that
  ~sort lists have only 4 distinct values, each repeated N/4 times
  and spread uniformaly across the whole list).

+ While it requires some extra memory, that lives on the stack and
  is O(log N).  As a result, it can never raise MemoryError unless
  a comparison function does.

+ It's never had a bug reported against it (so is stable in a different
  sense <wink>).

In favor of timsort:

+ It's stable.

+ The code is more uniform and so potentially easier to grok, and
  because it has no random component is easier to predict (e.g., it's
  certain that it has no quadratic-time cases).

+ It's incredibly faster in the face of many more kinds of mild
  disorder, which I believe are very common in the real world.  As
  obvious examples, you add an increment of new data to an already-
  sorted file, or paste together several sorted files.  timsort
  screams in those cases, but they may as well be random to
  samplesort, and the difference in runtime can easily exceed a
  factor of 10.  A factor of 10 is a rare and wonderful thing in
  algorithm development.

Against timsort:

+ It can require O(N) temp storage, although the constant is small
  compared to object sizes.  That means it can raise MemoryError
  even if a comparison function never does.

+ Very-many-equal-elements cases can be much slower, but that's partly
  because it *is* stable, and preserving the order of equal elements
  is exactly what makes stability hard to achieve in a fast sort
  (samplesort can't be made stable efficiently).

> As Tim's Official Sysadmin, I'm only good at channeling him on one
> subject, albeit probably one he'd deem most important to his life:
> lunch.  So I'm not sure if he's arguing for or against stable sort
> being the default. ;)

All else being equal, a stable sort is a better choice.  Alas, all else
isn't equal.  If Python had no sort method now, I'd pick timsort with scant
hesitation.  Speaking of which, is it time for lunch yet <wink>?