[Tutor] how to keep track of sorted lists

Steven D'Aprano steve at pearwood.info
Sat Nov 3 16:18:22 CET 2012


On 04/11/12 00:04, Albert-Jan Roskam wrote:
> Hello,
>
> I want to make a get() method that uses a binary search. This requires
>  the data to be sorted which is an expensive operation, so I would like
>to do this no more often than needed.

Reset your intuition. Sorting an almost-sorted list with Timsort (Python's
sort algorithm) is blindingly fast, approaching O(N), but running at full
C speed rather than pure Python. You may find that the easiest, fastest,
most efficient way to add data to a list and keep it sorted is to simply
append at the end and then sort.

class SortedSequence(object):
     def __init__(self, sequence):
         self.data = sorted(sequence)
     def insert(self, item):
         self.data.append(item)
         self.data.sort()


Another approach is to sort only on demand: tag the instance as "dirty",
and then whenever you do a lookup, if the list is dirty, sort first.

class SortedSequence(object):
     def __init__(self, sequence):
         self.data = sorted(sequence)
         self.dirty = False
     def insert(self, item):
         self.data.append(item)
         self.dirty = True
     def get(self, key):
         if self.dirty:
             self.data.sort()
             self.dirty = False
         # now use bisect to find the key


A third approach: always use the bisect algorithm to insert the item in
the right place.


class SortedSequence(object):
     def __init__(self, sequence):
         self.data = sorted(sequence)
     def insert(self, item):
         # use bisect in here...
     def get(self, key):
         # and in here too...


You may find that a hybrid approach is best, say using version 1 for
data lists with less than a million items, and version 3 for bigger
ones. Don't guess, measure and find out.

Whatever you do, make sure you test with BIG lists -- for small lists,
*any* approach will likely be fast enough.


Now that you have a SortedSequence type, you can easily deal with many
of them:

mapping = {}  # map param to data

mapping['x'] = SortedSequence(lots of data)
mapping['y'] = SortedSequence(more data)

mapping['x'].insert(item)
mapping['y'].get(key)


sort of thing.


One final thought... why are you using binary search instead of a dict?
Dict lookups will generally be much faster than binary search, O(1)
compared to O(log N), and simpler too. For large N, the difference
between constant-time lookups and those proportional to log N can be
substantial.


> Which of the to classes below is the best way to keep track of the
>  sorted lists?


Neither.




-- 
Steven


More information about the Tutor mailing list