Feature request: sorting a list slice

Thu May 18 18:51:45 EDT 2006

Am Donnerstag 18 Mai 2006 22:13 schrieb Raymond Hettinger:
> This is a false optimization.  The slicing steps are O(n) and the sort
> step is O(n log n) unless the data has some internal structure that
> Timsort can use to get closer to O(n).
>
> If you implemented this and timed in it real apps, I would be surprised
> to find that the slice extraction and assignments were the performance
> bottleneck.  IMO, a worthwhile performance gain is unlikely.

I personally wouldn't find this to be a false optimization, at least not 
memory-wise. If you sort really large lists, and only want to sort part of 
the list, the memory gains of not having to do a slice should be enormous, 
and there should be some time-gains too. And, additionally, implementing this 
(if I understand timsort correctly, just had a look at the sources) isn't 
hard.

Rather, I'd pose the usage question: why are you only sorting a part of a 
list? lists are basically meant for homogenous data. And sorting only a part 
of that should be a pretty meaningless operation, mostly...

Anyway, I've just written up a patch to extend sort() with a start/stop 
parameter (which behaves just like the default slice notation). Generally, 
this will make sort() setup slightly slower in the "standard" case (because 
there's some more pointer arithmetic involved, but this should be negligible, 
and is O(1)), but for the actual use case, the following numbers can be seen:

modelnine at phoenix ~/mercurial/python-modelnine $ ./python test.py
New average time: 13.7254650593 ms per loop
Old average time: 14.839854002 ms per loop
[10198 refs]
modelnine at phoenix ~/mercurial/python-modelnine $

This is just a very, very simple test (testing the case of a list with one 
million entries, where 99000 are sorted):

>>>
from random import randrange
from time import time

x = [randrange(256) for x in xrange(1000000)]
timesnew, timesold = [], []

for reps in range(1000):
    y = x[:]
    timesnew.append(time())
    y.sort(start=1000,stop=10000)
    timesnew[-1] = time() - timesnew[-1]

for reps in range(1000):
    y = x[:]
    timesold.append(time())
    y[1000:10000] = sorted(y[1000:10000])
    timesold[-1] = time() - timesold[-1]

print "New average time:", sum(timesnew), "ms per loop"
print "Old average time:", sum(timesold), "ms per loop"
>>>

I'll post the patch to the SF bugtracker tomorrow, it's too late today for me 
to review it, but generally, I wouldn't find it to be a bad addition, if 
there's actually a general use case to only sort parts of a list.

--- Heiko.