[Python-Dev] Python3 regret about deleting list.sort(cmp=...)

Sun Mar 13 23:45:27 CET 2011

On 3/13/2011 2:05 PM, Daniel Stutzbach wrote:
> On Sat, Mar 12, 2011 at 3:44 PM, Guido van Rossum <guido at python.org
> <mailto:guido at python.org>> wrote:
>
>     I recently advised a Googler who was sorting a large dataset and
>     running out of memory. My analysis of the situation was that he was
>     sorting a huge list of short lines of the form "shortstring,integer"
>     with a key function that returned a tuple of the form ("shortstring",
>     integer).
>
>
> As Raymond pointed out, a change I made for 3.2 significantly shrinks
> the memory footprint of sorting with a key (although it's still more
> memory-intensive than sorting with cmp).
>
> He could reduce the memory footprint further by sorting in two passes
> instead of using a tuple, leveraging the fact that Python guarantees a
> stable sort.  In 3.2 or later, this technique will require roughly twice
> as much memory as just storing the list:
>
> biglist.sort(key=lambda s: int(s.split(',')[1]))  # Sort by the integer
> biglist.sort(key=lambda s: s.split(',')[0])  # Sort by the shortstring
>
> I think the use cases are pretty narrow where there's plenty of memory
> for storing the list but not enough to store two copies.

Here are two variations on the idea of two pass sorting, both of which 
only sort ints for duplicate shortstrings and neither of which use key=.

1. If duplication of shortstrings is (known to be) sparse, sort the 
lines as they are, then scan to detect runs (slices) of duplicate 
shortstrings and sort and replace those. (If sort had start and stop 
index keywords, this could be done in place.)

2. If duplication of shortstrings is (known to be) heavy, read the 
dataset into a shortstring-list_of_ints dict rathar than a list. Sort 
the keys in a separate (much shorted) list and sort each value (list of 
ints) separately. For some post-sort usages, this would be more useful 
than a flat sorted list.

A third idea is to presort the dataset into chunks (a-m, n-z) or however 
appropriate, perhaps as it is gathered; sort each separately; and then 
concatenate. This will handle cases where even the flat list is too big 
for memory.

-- 
Terry Jan Reedy