list index()

Wed Sep 5 00:31:57 EDT 2007

Neil Cerutti <horpner at yahoo.com> wrote:

> It's probable that a simpler implementation using slice
> operations will be faster for shortish lengths of subseq. It was
> certainly easier to get it working correctly. ;)
> 
> def find(seq, subseq):
>   for i, j in itertools.izip(xrange(len(seq)-len(subseq)),
>                              xrange(len(subseq), len(seq))):
>     if subseq == seq[i:j]:
>       return i
>   return -1

Simpler yet (though maybe slower!-):

def find(seq, subseq):
    L = len(subseq)
    for i in xrange(0, len(seq)-L):
        if subseq == seq[i:i+L]: return i
    return -1

also worth trying (may be faster in some cases, e.g. when the first item
of the subsequence occurs rarely in the sequence):

def find(seq, subseq):
    L = len(subseq)
    firstitem = subseq[0]
    end = len(seq) - len(subseq)
    i = -1
    while 1:
        try: i = seq.index(firstitem, i+1, end)
        except ValueError: return -1
        if subseq == seq[i:i+L]: return i

For particularly long sequences (with hashable items) it might even be
worth trying variants of Boyer-Moore, Horspool, or Knuth-Morris-Pratt;
while these search algorithms are mostly intended for text strings,
since you need tables indexed by the item values, using dicts for such
tables might yet be feasible (however, the program won't be quite that
simple).  Benchmarking of various possibilities on typical input data
for your application is recommended, as performance may vary!

Alex