best way to determine sequence ordering?

Sat Apr 29 09:05:35 EDT 2006

Steven Bethard wrote:

> nikie wrote:
> > Steven Bethard wrote:
> >
> >> John Salerno wrote:
> >>> If I want to make a list of four items, e.g. L = ['C', 'A', 'D', 'B'],
> >>> and then figure out if a certain element precedes another element, what
> >>> would be the best way to do that?
> >>>
> >>> Looking at the built-in list functions, I thought I could do something
> >>> like:
> >>>
> >>> if L.index('A') < L.index('D'):
> >>>     # do some stuff
> >> This is probably a pretty reasonable approach as long as you're not too
> >> worried about efficiency.  It scans the list twice though, so if you
> >> start doing this with really big lists, it might be better to do
> >> something like:
> >
> > On average, it'll only have go through half the list twice, as it can
> > break as soon as it finds the item it searches for. Don't think you can
> > get any more efficient than that.
>
> Sure you can:
>
>      a_index = None
>      d_index = None
>      for i, item in enumerate(L):
>          if item == 'A':
>              a_index = i
>          elif item == 'D':
>              d_index = i
>          if a_index is not None and d_index is not None:
>              break
>
>      if a_index < d_index:
>          # do some stuff
>
> That goes through exactly the minimum number of elements.  But it's
> probably slower than two .index() calls since .index() is coded in C.
> But timeit and see if you're interested.

On my PC, calling index twice is 5 times faster.

But let's just pretend for a moment Python code was compiled and
optimized as well as C code. Then, this function would still not be
faster than calling index twice. You seem to be under the impression
that "looping through a list" takes a lot of time and "comparing items"
takes none. The reverse is the case: Looping through a list means the
processor has to do something like "increment an index" once for every
item in the list (for optimized C code! Looping through a generator
like the one returned by enumerate in interpreted code is a lot more
complex). Comparing two items on the other hand involves (as lists
aren't statically typed) looking up a cmp-method dynamically, calling
it dynamically, and, of course, a string comparison, which again
involves two pointer lookups for every character that has to be
compared. So, if you want to know how fast your function will be,
you'll have to count the number of comparisons. (Of course, we're
ignoring the fact that smaller functions tend to run quicker due to
cache-, branch-prediction and optimization-effects) If you call your
function with the list ['C', 'A', 'D', 'B'], it will compare the first
item 'C' to 'A' and 'D', than the second, 'A' to 'A' and the third 'D'
to 'A' and 'D', so it'll have to do 5 comparisons, correct? If you call
index to find the first occurence of the item 'A' in the same list, it
will have to do 2 comparisons (it can break as soon as it finds the
iten), and 3 comparisons are needed for finding the item 'D', so it'll
do 5 comparisons, too. Plus, you have a small overhead for comparing
"a_index" and "d_index" to none (which is faster than a
sting-comparison, but still takes time, probably more time than
incrementing an index for looping through a list). Things get even
worse if 'A' and 'D' aren't "neighbors" in the list, because than all
the items bewteen 'A' and 'D' will have to be compared to 'A' and 'D'
in the version above, but only to 'D' if you call index twice. So, the
function above isn't only less readable, it's also slower on the
average case.

You might however be able to tweak the performance of the index-version
a little bit if you call it only on small chunks of the array at a
time, using the index()-versions that take a start- and stop index, so
the whole list only has to be moved from the memory to the cache once.
But I'm not sure the performance is memory-bound at all (always hard to
tell in an interpreted language).

> >
> >>  >>> L = ['C', 'A', 'D', 'B']
> >>  >>> positions = dict((item, i) for i, item in enumerate(L))
> >>  >>> positions
> >> {'A': 1, 'C': 0, 'B': 3, 'D': 2}
> >>  >>> positions['A'] < positions['D']
> >> True
> >
> > Isn't this bound to be less efficient? I mean, inserting the items into
> > a dict is probably O(n log n)
>
> No, it's O(N).  Dicts are just hash tables.  Insertion is O(1).  So N
> insertions is O(N).

I've measured that with the timeit module. The time it takes to build a
dict with N entries doesn't seem to be proportional to N, (t-c)/n keeps
increasing. Try this:

import timeit, math

def time(N):
    return timeit.Timer("dict([('%%010i'%%i,i) for i in range(%i)])" %
N).timeit(number=5)

c = time(1000)*2-time(2000)

for i in range(1000,1000000,1000):
    t = time(i)
    print "%5i -> %f (t/n = %f)" % (i,t, (t-c)/i*1000)

> This route is also dramatically more efficient if you need to make M
> comparisons between items instead of just 1 comparison.  In that case,
> using the dict is O(N + M) instead of the O(N * M) of the .index() route.

Assuming (as I have said already) that you build the dict once, but
call index again and again for every comparison, which is of course
comparing apples with oranges.