[Python-Dev] Drastically improving list.sort() for lists of strings/ints

Sun Sep 11 16:48:50 EDT 2016

On 9/11/2016 3:45 AM, Elliot Gorokhovsky wrote:
> Hello all,
>
> I am interested in making a non-trivial improvement to list.sort(),

This is non-trivial, and harder than you seem to think it is.

 > but
> before I put in the work, I want to test the waters and see if this is
> something the community would accept.

The debate on proposed enhancements is usually whether they are really 
enhancements, all things considered.  For special-case speedups, the 
'all things' include the frequency of the special cases, the ease of 
detecting them, the thoroughness of testintg, and the maintainability of 
the proposed, and likely more complicated, code.

> Basically, I want to implement radix sort for lists of strings.

Radix sort was invented for fixed-length strings of digits, as in 
all-digit id 'numbers', so 10 bins.  Ascii strings need 128 bins, 
general byte strings need 256, still manageable.  General unicode 
strings require 1,114,112 bins, most of which will be empty for most 
characters positions.  This is harder to manage.  So are variable-length 
strings.

In CPython 3.3+, strings at the C level are not just strings but are 1, 
2, or 4 bytes per char strings.  So you could specifically target lists 
of bytes (and bytearrays) and lists of strings limited to 1-byte 
characters.  The same code should pretty much work for both.

 > ...
> In-place radix sort is significantly faster for lexicographic sorting than
> Timsort (or in general any comparison-based sort, since radix can beat
> the nlogn barrier).

This unqualified statement is doubly wrong.

First, with respect to sorting in general: 'aysmptotically faster' only 
means faster for 'large enough' tasks.  Many real world tasks are small, 
and big tasks gets broken down into multiple small tasks. 
'Asymptotically slower' algoritms may be faster for small tasks. Tim 
Peters investigated and empirically determined that an O(n*n) binary 
insort, as he optimized it on real machines, is faster than O(n*logn) 
sorting for up to around 64 items.  So timsort uses binary insertion to 
sort up to 64 items.  Similarly, you would have to investigate whether 
there is a size below which timsort is faster.

Second, with respect to timsort in particular: timsort is designed to 
exploit structure and run faster than O(n*logn) in special cases.  If a 
list is already sorted, timsort will do one O(n) scan and stop.  Any 
radix sort will take several times longer.  If a list is reverse sorted, 
timsort will do one O(n) scan and do an O(n) reverse.  If a list is the 
concatenation of two sorted lists, timsort will find the two sorted 
sublists and merge them.  If a sorted list has unsorted items appended 
to the end, timsort will sort the appended items and then do a merge.  I 
expect any radix sort to be slower for all these cases.  Tim Peters 
somewhere documented his experiments and results with various special 
but plausible cases of non-randomness.

What you might propose is this:  if the initial up or down sequence 
already determined by timsort is less than, say, 64 items, and the 
length of the list is more than some empirically determined value,
and all items are bytes or byte-char strings and some further checks 
determine the same, then switch to rsort.  But you should start by 
putting a module on PyPI.

-- 
Terry Jan Reedy