[Tutor] how to keep track of sorted lists
Steven D'Aprano
steve at pearwood.info
Sat Nov 3 16:18:22 CET 2012
On 04/11/12 00:04, Albert-Jan Roskam wrote:
> Hello,
>
> I want to make a get() method that uses a binary search. This requires
> the data to be sorted which is an expensive operation, so I would like
>to do this no more often than needed.
Reset your intuition. Sorting an almost-sorted list with Timsort (Python's
sort algorithm) is blindingly fast, approaching O(N), but running at full
C speed rather than pure Python. You may find that the easiest, fastest,
most efficient way to add data to a list and keep it sorted is to simply
append at the end and then sort.
class SortedSequence(object):
def __init__(self, sequence):
self.data = sorted(sequence)
def insert(self, item):
self.data.append(item)
self.data.sort()
Another approach is to sort only on demand: tag the instance as "dirty",
and then whenever you do a lookup, if the list is dirty, sort first.
class SortedSequence(object):
def __init__(self, sequence):
self.data = sorted(sequence)
self.dirty = False
def insert(self, item):
self.data.append(item)
self.dirty = True
def get(self, key):
if self.dirty:
self.data.sort()
self.dirty = False
# now use bisect to find the key
A third approach: always use the bisect algorithm to insert the item in
the right place.
class SortedSequence(object):
def __init__(self, sequence):
self.data = sorted(sequence)
def insert(self, item):
# use bisect in here...
def get(self, key):
# and in here too...
You may find that a hybrid approach is best, say using version 1 for
data lists with less than a million items, and version 3 for bigger
ones. Don't guess, measure and find out.
Whatever you do, make sure you test with BIG lists -- for small lists,
*any* approach will likely be fast enough.
Now that you have a SortedSequence type, you can easily deal with many
of them:
mapping = {} # map param to data
mapping['x'] = SortedSequence(lots of data)
mapping['y'] = SortedSequence(more data)
mapping['x'].insert(item)
mapping['y'].get(key)
sort of thing.
One final thought... why are you using binary search instead of a dict?
Dict lookups will generally be much faster than binary search, O(1)
compared to O(log N), and simpler too. For large N, the difference
between constant-time lookups and those proportional to log N can be
substantial.
> Which of the to classes below is the best way to keep track of the
> sorted lists?
Neither.
--
Steven
More information about the Tutor
mailing list